https://github.com/batmen-lab/phylomix

https://github.com/batmen-lab/phylomix

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.3%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: batmen-lab
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 2.78 MB
Statistics
  • Stars: 2
  • Watchers: 2
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created almost 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme

README.md

README: Mixup and Phylomix for Phylogeny Datasets

Overview

This project provides an implementation of data augmentation methods for phylogeny datasets, with a focus on leveraging tree-based relationships between features. The main method introduced is Phylomix, which combines phylogenetic information with compositional data for enhanced data augmentation. In addition, several baseline methods are implemented, including vanilla Mixup, Aitchison Mixup, and Cutmix variants.

The augmentation methods aim to improve model generalization by creating synthetic samples based on relationships between existing samples. These methods are particularly suitable for hierarchical and compositional datasets.


Features

  1. Phylomix: A novel method using phylogeny or taxonomy trees for Mixup.
  2. Vanilla Mixup: Traditional Mixup with linear interpolation.
  3. Aitchison Mixup: Mixup based on Aitchison geometry for compositional data.
  4. Compositional Cutmix: Variants of Cutmix adapted to phylogenetic trees.
  5. Aitchison Mixup: Mixup in the Aitchison space.

Dataset

To access dataset: Google Drive.

Our experiment contains TADA augmentation, please refer to the repo TADA. Add the augmented data file into the data folders.

For MB-GAN, please refer to the repo MB-GAN. And we provide a script to prepare our dataset in the format of MB-GAN. Please run:

bash python src/gan_data_prepare.py

And it will prepare the data in the format of required by MB-GAN.


Requirements

Required packages are listed in environment.yml. You can create a conda environment with the following command:

bash conda env create -f environment.yml


Classes and Methods

Mixup

The Mixup class encapsulates all Mixup and Cutmix augmentation methods.

Constructor

python Mixup(dataset, taxonomy_tree, phylogeny_tree, contrastive_learning=False) Parameters: - dataset: A PhylogenyDataset instance containing data to augment. - taxonomy_tree: A PhylogenyTree instance for taxonomy. - phylogeny_tree: A PhylogenyTree instance for phylogeny. - contrastive_learning: (Optional) Enable contrastive learning mode (default: False).

Methods

mixup

Performs Mixup augmentation. python mixup(num_samples, method, alpha, tree, min_threshold=None, max_threshold=None, index1=None, index2=None, contrastive_learning=False, seed=0) Parameters: - num_samples: Number of Mixup samples to generate. - method: Mixup method (vanilla, aitchison, phylomix). - alpha: Beta distribution parameter for sample mixing. - tree: Type of tree to use (taxonomy or phylogeny). - Additional parameters for specific indices and thresholds.

Returns: Augmented PhylogenyDataset.

compositional_cutmix

Performs Cutmix augmentation based on compositional data. python compositional_cutmix(num_samples, min_threshold=None, max_threshold=None)

intra_mixup

Performs Mixup within the same class. python intra_mixup(min_threshold, max_threshold, method, num_samples, alpha)

intra_cutmix

Performs intra-class Cutmix by swapping subtrees. python intra_cutmix(min_threshold, max_threshold, num_samples, height, num_subtrees)


Training

  • Specify arguments in an argument file.
  • To run supervised learning training:

bash bash run_job.sh args_file.txt supervised

  • To run contrastive learning training:

bash bash run_job.sh args_file.txt contrastive


Usage

1. Setting Up the Data

The setup_data function initializes the dataset and the phylogeny tree from the specified file paths. This function is designed to make it easy to load and prepare your data for augmentation, and we prune the tree leaves to match the number of data features.

```python from mixup import setup_data

File paths to your data and phylogeny tree

datafp = 'path/to/your/datafile.tsv.xz' metafp = 'path/to/your/metafile.tsv' targetfp = 'path/to/your/targetfile.py' phylogenytreefp = 'path/to/your/phylogeny_tree.nwk'

Initialize the dataset and phylogeny tree

data, tree = setupdata(datafp, metafp, targetfp, phylogenytreefp, prune=True) ```

2. Augmenting the Data

Once you have your dataset and tree set up, you can use the augment function to apply a variety of augmentation techniques. This function allows you to easily apply mixup-based augmentations.

```python from mixup import augment

augmenteddata = augment( data=data, phylogenytree=tree, numsamples=3.0, # Augment to 3 times the original number of samples alpha=2.0, augtype='phylomix' ) ```

3. Applying Baseline Methods

Vanilla Mixup

Vanilla Mixup uses linear interpolation between two samples. python augmented_dataset = mixup_instance.mixup( num_samples=100, method='vanilla', alpha=0.5, tree='taxonomy' )

Aitchison Mixup

Aitchison Mixup applies Mixup using the Aitchison geometry, which is specifically suited for compositional data. python augmented_dataset = mixup_instance.mixup( num_samples=100, method='aitchison', alpha=0.5, tree='taxonomy' )

Compositional Cutmix

Compositional Cutmix swaps data between samples in a way that respects compositional constraints. python augmented_dataset = mixup_instance.compositional_cutmix(num_samples=50)

Intra-Class Mixup

This baseline performs Mixup within the same class, ensuring that generated samples are class-consistent. python augmented_data = mixup_instance.intra_mixup( min_threshold=0.1, max_threshold=0.8, method='intra_aitchison', num_samples=150, alpha=0.6 )

Intra-Class Cutmix

Intra-Class Cutmix swaps specific subtrees within the same class, respecting hierarchical structures. python augmented_dataset = mixup_instance.intra_cutmix( min_threshold=0.1, max_threshold=0.5, num_samples=50, height=3, num_subtrees=2 )


Examples

Example 1: Phylomix Augmentation

python augmented_data = mixup_instance.mixup( num_samples=200, method='phylomix', alpha=0.3, tree='phylogeny' )

Example 2: Intra-Class Aitchison Mixup

```python augmenteddata = mixupinstance.intramixup( minthreshold=0.1, maxthreshold=0.8, method='intraaitchison', num_samples=150, alpha=0.6 )

```

Notes

  • Ensure that the taxonomy and phylogeny trees have the same leaves; otherwise, prune them.
  • Use contrastive learning mode for unsupervised tasks.
  • Experiment with alpha to control the degree of interpolation.

License

This project is open-source and licensed under the MIT License.

Owner

  • Name: BATMEN Lab @ UWaterloo
  • Login: batmen-lab
  • Kind: user
  • Company: UWaterloo CS

GitHub Events

Total
  • Watch event: 1
  • Push event: 1
Last Year
  • Watch event: 1
  • Push event: 1

Dependencies

environment.yml conda
  • _libgcc_mutex 0.1.*
  • _openmp_mutex 4.5.*
  • abseil-cpp 20230802.0.*
  • absl-py 1.4.0.*
  • aiohttp 3.9.0.*
  • aiosignal 1.2.0.*
  • anyio 3.5.0.*
  • argon2-cffi 23.1.0.*
  • argon2-cffi-bindings 21.2.0.*
  • arrow 1.2.3.*
  • asttokens 2.4.1.*
  • async-lru 2.0.4.*
  • attrs 23.1.0.*
  • babel 2.14.0.*
  • backoff 2.2.1.*
  • beautifulsoup4 4.12.2.*
  • biom-format 2.1.15.*
  • blas 1.0.*
  • bleach 6.1.0.*
  • blessed 1.20.0.*
  • blinker 1.6.2.*
  • boto3 1.29.1.*
  • botocore 1.32.1.*
  • bottleneck 1.3.5.*
  • brotli 1.0.9.*
  • brotli-bin 1.0.9.*
  • brotli-python 1.0.9.*
  • bzip2 1.0.8.*
  • c-ares 1.19.1.*
  • ca-certificates 2024.3.11.*
  • cachecontrol 0.12.11.*
  • cached-property 1.5.2.*
  • cached_property 1.5.2.*
  • cachetools 4.2.2.*
  • certifi 2024.2.2.*
  • cffi 1.16.0.*
  • charset-normalizer 2.0.4.*
  • cleo 2.0.1.*
  • click 8.1.7.*
  • colorama 0.4.6.*
  • comm 0.2.1.*
  • contourpy 1.2.0.*
  • crashtest 0.4.1.*
  • croniter 1.3.7.*
  • cryptography 41.0.7.*
  • cuda-cudart 12.1.105.*
  • cuda-cupti 12.1.105.*
  • cuda-libraries 12.1.0.*
  • cuda-nvrtc 12.1.105.*
  • cuda-nvtx 12.1.105.*
  • cuda-opencl 12.3.101.*
  • cuda-runtime 12.1.0.*
  • cycler 0.11.0.*
  • cython 3.0.7.*
  • dateutils 0.6.12.*
  • dbus 1.13.18.*
  • debugpy 1.8.0.*
  • decorator 5.1.1.*
  • deepdiff 6.7.1.*
  • defusedxml 0.7.1.*
  • dendropy 4.6.1.*
  • distlib 0.3.6.*
  • dulwich 0.21.3.*
  • entrypoints 0.4.*
  • exceptiongroup 1.2.0.*
  • executing 2.0.1.*
  • expat 2.5.0.*
  • fastapi 0.103.0.*
  • ffmpeg 4.3.*
  • filelock 3.13.1.*
  • fonttools 4.25.0.*
  • fqdn 1.5.1.*
  • freetype 2.12.1.*
  • frozenlist 1.4.0.*
  • fsspec 2023.10.0.*
  • giflib 5.2.1.*
  • glib 2.69.1.*
  • gmp 6.2.1.*
  • gmpy2 2.1.2.*
  • gnutls 3.6.15.*
  • google-auth 2.22.0.*
  • google-auth-oauthlib 0.5.2.*
  • grpc-cpp 1.48.2.*
  • grpcio 1.48.2.*
  • gtest 1.14.0.*
  • h11 0.12.0.*
  • h2 4.1.0.*
  • hdf5 1.12.2.*
  • hdmedians 0.14.2.*
  • hpack 4.0.0.*
  • html5lib 1.1.*
  • httpcore 0.15.0.*
  • httpx 0.25.1.*
  • hyperframe 6.0.1.*
  • idna 3.4.*
  • importlib-metadata 7.0.1.*
  • importlib_metadata 7.0.1.*
  • importlib_resources 6.3.1.*
  • iniconfig 2.0.0.*
  • inquirer 3.1.4.*
  • intel-openmp 2023.1.0.*
  • ipykernel 6.29.0.*
  • ipython 8.20.0.*
  • isoduration 20.11.0.*
  • itsdangerous 2.0.1.*
  • jaraco.classes 3.2.1.*
  • jedi 0.19.1.*
  • jeepney 0.7.1.*
  • jinja2 3.1.2.*
  • jmespath 1.0.1.*
  • joblib 1.2.0.*
  • jpeg 9e.*
  • json5 0.9.24.*
  • jsonpointer 2.4.*
  • jsonschema 4.19.2.*
  • jsonschema-specifications 2023.7.1.*
  • jsonschema-with-format-nongpl 4.19.2.*
  • jupyter-lsp 2.2.4.*
  • jupyter_client 7.4.9.*
  • jupyter_core 5.7.1.*
  • jupyter_events 0.10.0.*
  • jupyter_server 2.13.0.*
  • jupyter_server_terminals 0.5.3.*
  • jupyterlab 4.1.5.*
  • jupyterlab_pygments 0.3.0.*
  • jupyterlab_server 2.25.4.*
  • keyring 23.13.1.*
  • keyutils 1.6.1.*
  • kiwisolver 1.4.4.*
  • krb5 1.21.2.*
  • lame 3.100.*
  • lcms2 2.12.*
  • ld_impl_linux-64 2.38.*
  • lerc 3.0.*
  • libaec 1.1.2.*
  • libbrotlicommon 1.0.9.*
  • libbrotlidec 1.0.9.*
  • libbrotlienc 1.0.9.*
  • libcublas 12.1.0.26.*
  • libcufft 11.0.2.4.*
  • libcufile 1.8.1.2.*
  • libcurand 10.3.4.107.*
  • libcurl 8.4.0.*
  • libcusolver 11.4.4.55.*
  • libcusparse 12.0.2.55.*
  • libdeflate 1.17.*
  • libedit 3.1.20191231.*
  • libev 4.33.*
  • libffi 3.4.4.*
  • libgcc-ng 13.2.0.*
  • libgfortran-ng 11.2.0.*
  • libgfortran5 11.2.0.*
  • libgomp 13.2.0.*
  • libiconv 1.16.*
  • libidn2 2.3.4.*
  • libjpeg-turbo 2.0.0.*
  • libllvm14 14.0.6.*
  • libnghttp2 1.52.0.*
  • libnpp 12.0.2.50.*
  • libnvjitlink 12.1.105.*
  • libnvjpeg 12.1.1.14.*
  • libpng 1.6.39.*
  • libprotobuf 3.20.3.*
  • libsodium 1.0.18.*
  • libssh2 1.11.0.*
  • libstdcxx-ng 13.2.0.*
  • libtasn1 4.19.0.*
  • libtiff 4.5.1.*
  • libunistring 0.9.10.*
  • libuuid 1.41.5.*
  • libwebp 1.3.2.*
  • libwebp-base 1.3.2.*
  • libzlib 1.2.13.*
  • lightning 2.1.2.*
  • lightning-cloud 0.5.57.*
  • lightning-utilities 0.9.0.*
  • llvm-openmp 14.0.6.*
  • llvmlite 0.41.1.*
  • lockfile 0.12.2.*
  • lz4-c 1.9.4.*
  • markdown 3.4.1.*
  • markdown-it-py 2.2.0.*
  • markupsafe 2.1.3.*
  • matplotlib-base 3.8.0.*
  • matplotlib-inline 0.1.6.*
  • mdurl 0.1.0.*
  • mistune 3.0.2.*
  • mkl 2023.1.0.*
  • mkl-service 2.4.0.*
  • mkl_fft 1.3.8.*
  • mkl_random 1.2.4.*
  • more-itertools 10.1.0.*
  • mpc 1.1.0.*
  • mpfr 4.0.2.*
  • mpmath 1.3.0.*
  • msgpack-python 1.0.3.*
  • multidict 6.0.4.*
  • munkres 1.1.4.*
  • natsort 8.4.0.*
  • nbclient 0.10.0.*
  • nbconvert-core 7.16.2.*
  • nbformat 5.10.3.*
  • ncurses 6.4.*
  • nest-asyncio 1.5.9.*
  • nettle 3.7.3.*
  • networkx 3.1.*
  • notebook 7.1.2.*
  • notebook-shim 0.2.4.*
  • numba 0.58.1.*
  • numexpr 2.8.7.*
  • numpy 1.26.3.*
  • numpy-base 1.26.3.*
  • oauthlib 3.2.2.*
  • openh264 2.1.1.*
  • openjpeg 2.4.0.*
  • openssl 3.2.1.*
  • ordered-set 4.1.0.*
  • orjson 3.9.10.*
  • overrides 7.7.0.*
  • packaging 23.1.*
  • pandocfilters 1.5.0.*
  • parso 0.8.3.*
  • pcre 8.45.*
  • pexpect 4.8.0.*
  • pickleshare 0.7.5.*
  • pillow 10.0.1.*
  • pip 23.3.1.*
  • pkginfo 1.9.6.*
  • platformdirs 2.5.2.*
  • pluggy 1.3.0.*
  • poetry 1.4.0.*
  • poetry-core 1.5.1.*
  • poetry-plugin-export 1.3.0.*
  • prometheus_client 0.20.0.*
  • prompt-toolkit 3.0.42.*
  • protobuf 3.20.3.*
  • psutil 5.9.0.*
  • ptyprocess 0.7.0.*
  • pure_eval 0.2.2.*
  • pyasn1 0.4.8.*
  • pyasn1-modules 0.2.8.*
  • pycparser 2.21.*
  • pydantic 1.10.12.*
  • pygments 2.15.1.*
  • pyjwt 2.4.0.*
  • pynndescent 0.5.10.*
  • pyopenssl 23.2.0.*
  • pyparsing 3.0.9.*
  • pyproject_hooks 1.0.0.*
  • pysocks 1.7.1.*
  • pytest 7.4.4.*
  • python 3.11.5.*
  • python-build 0.10.0.*
  • python-dateutil 2.8.2.*
  • python-editor 1.0.4.*
  • python-fastjsonschema 2.19.1.*
  • python-installer 0.6.0.*
  • python-json-logger 2.0.7.*
  • python-multipart 0.0.6.*
  • python-tzdata 2023.3.*
  • python_abi 3.11.*
  • pytorch 2.1.2.*
  • pytorch-cuda 12.1.*
  • pytorch-lightning 2.0.3.*
  • pytorch-mutex 1.0.*
  • pytz 2023.3.post1.*
  • pyyaml 6.0.1.*
  • pyzmq 24.0.1.*
  • rapidfuzz 2.13.7.*
  • re2 2022.04.01.*
  • readchar 4.0.5.*
  • readline 8.2.*
  • referencing 0.30.2.*
  • requests 2.31.0.*
  • requests-oauthlib 1.3.0.*
  • requests-toolbelt 0.9.1.*
  • rfc3339-validator 0.1.4.*
  • rfc3986-validator 0.1.1.*
  • rich 13.3.5.*
  • rpds-py 0.10.6.*
  • rsa 4.7.2.*
  • s3transfer 0.7.0.*
  • scikit-bio 0.5.8.*
  • scikit-learn 1.2.2.*
  • scipy 1.11.4.*
  • secretstorage 3.3.1.*
  • send2trash 1.8.2.*
  • setuptools 68.2.2.*
  • shellingham 1.5.0.*
  • six 1.16.0.*
  • sniffio 1.2.0.*
  • soupsieve 2.5.*
  • sqlite 3.41.2.*
  • stack_data 0.6.2.*
  • starlette 0.27.0.*
  • starsessions 1.3.0.*
  • sympy 1.12.*
  • tbb 2021.8.0.*
  • tensorboard-data-server 0.7.0.*
  • tensorboard-plugin-wit 1.6.0.*
  • terminado 0.18.1.*
  • threadpoolctl 2.2.0.*
  • tinycss2 1.2.1.*
  • tk 8.6.12.*
  • tomli 2.0.1.*
  • tomlkit 0.11.1.*
  • torchaudio 2.1.2.*
  • torchmetrics 1.1.2.*
  • torchtriton 2.1.0.*
  • torchvision 0.16.2.*
  • tornado 6.3.3.*
  • tqdm 4.65.0.*
  • traitlets 5.7.1.*
  • trove-classifiers 2023.10.18.*
  • typing-extensions 4.9.0.*
  • typing_extensions 4.9.0.*
  • typing_utils 0.1.0.*
  • tzdata 2023d.*
  • umap-learn 0.5.3.*
  • uri-template 1.3.0.*
  • urllib3 1.26.18.*
  • uvicorn 0.20.0.*
  • virtualenv 20.17.1.*
  • wcwidth 0.2.5.*
  • webcolors 1.13.*
  • webencodings 0.5.1.*
  • websocket-client 0.58.0.*
  • websockets 10.4.*
  • werkzeug 2.2.3.*
  • wheel 0.41.2.*
  • xz 5.4.5.*
  • yaml 0.2.5.*
  • yarl 1.9.3.*
  • zeromq 4.3.5.*
  • zipp 3.17.0.*
  • zlib 1.2.13.*
  • zstd 1.5.5.*