https://github.com/batmen-lab/phylomix
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: batmen-lab
- Language: Jupyter Notebook
- Default Branch: main
- Size: 2.78 MB
Statistics
- Stars: 2
- Watchers: 2
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
README: Mixup and Phylomix for Phylogeny Datasets
Overview
This project provides an implementation of data augmentation methods for phylogeny datasets, with a focus on leveraging tree-based relationships between features. The main method introduced is Phylomix, which combines phylogenetic information with compositional data for enhanced data augmentation. In addition, several baseline methods are implemented, including vanilla Mixup, Aitchison Mixup, and Cutmix variants.
The augmentation methods aim to improve model generalization by creating synthetic samples based on relationships between existing samples. These methods are particularly suitable for hierarchical and compositional datasets.
Features
- Phylomix: A novel method using phylogeny or taxonomy trees for Mixup.
- Vanilla Mixup: Traditional Mixup with linear interpolation.
- Aitchison Mixup: Mixup based on Aitchison geometry for compositional data.
- Compositional Cutmix: Variants of Cutmix adapted to phylogenetic trees.
- Aitchison Mixup: Mixup in the Aitchison space.
Dataset
To access dataset: Google Drive.
Our experiment contains TADA augmentation, please refer to the repo TADA. Add the augmented data file into the data folders.
For MB-GAN, please refer to the repo MB-GAN. And we provide a script to prepare our dataset in the format of MB-GAN. Please run:
bash
python src/gan_data_prepare.py
And it will prepare the data in the format of required by MB-GAN.
Requirements
Required packages are listed in environment.yml. You can create a conda environment with the following command:
bash
conda env create -f environment.yml
Classes and Methods
Mixup
The Mixup class encapsulates all Mixup and Cutmix augmentation methods.
Constructor
python
Mixup(dataset, taxonomy_tree, phylogeny_tree, contrastive_learning=False)
Parameters:
- dataset: A PhylogenyDataset instance containing data to augment.
- taxonomy_tree: A PhylogenyTree instance for taxonomy.
- phylogeny_tree: A PhylogenyTree instance for phylogeny.
- contrastive_learning: (Optional) Enable contrastive learning mode (default: False).
Methods
mixup
Performs Mixup augmentation.
python
mixup(num_samples, method, alpha, tree, min_threshold=None, max_threshold=None,
index1=None, index2=None, contrastive_learning=False, seed=0)
Parameters:
- num_samples: Number of Mixup samples to generate.
- method: Mixup method (vanilla, aitchison, phylomix).
- alpha: Beta distribution parameter for sample mixing.
- tree: Type of tree to use (taxonomy or phylogeny).
- Additional parameters for specific indices and thresholds.
Returns: Augmented PhylogenyDataset.
compositional_cutmix
Performs Cutmix augmentation based on compositional data.
python
compositional_cutmix(num_samples, min_threshold=None, max_threshold=None)
intra_mixup
Performs Mixup within the same class.
python
intra_mixup(min_threshold, max_threshold, method, num_samples, alpha)
intra_cutmix
Performs intra-class Cutmix by swapping subtrees.
python
intra_cutmix(min_threshold, max_threshold, num_samples, height, num_subtrees)
Training
- Specify arguments in an argument file.
- To run supervised learning training:
bash
bash run_job.sh args_file.txt supervised
- To run contrastive learning training:
bash
bash run_job.sh args_file.txt contrastive
Usage
1. Setting Up the Data
The setup_data function initializes the dataset and the phylogeny tree from the specified file paths. This function is designed to make it easy to load and prepare your data for augmentation, and we prune the tree leaves to match the number of data features.
```python from mixup import setup_data
File paths to your data and phylogeny tree
datafp = 'path/to/your/datafile.tsv.xz' metafp = 'path/to/your/metafile.tsv' targetfp = 'path/to/your/targetfile.py' phylogenytreefp = 'path/to/your/phylogeny_tree.nwk'
Initialize the dataset and phylogeny tree
data, tree = setupdata(datafp, metafp, targetfp, phylogenytreefp, prune=True) ```
2. Augmenting the Data
Once you have your dataset and tree set up, you can use the augment function to apply a variety of augmentation techniques. This function allows you to easily apply mixup-based augmentations.
```python from mixup import augment
augmenteddata = augment( data=data, phylogenytree=tree, numsamples=3.0, # Augment to 3 times the original number of samples alpha=2.0, augtype='phylomix' ) ```
3. Applying Baseline Methods
Vanilla Mixup
Vanilla Mixup uses linear interpolation between two samples.
python
augmented_dataset = mixup_instance.mixup(
num_samples=100,
method='vanilla',
alpha=0.5,
tree='taxonomy'
)
Aitchison Mixup
Aitchison Mixup applies Mixup using the Aitchison geometry, which is specifically suited for compositional data.
python
augmented_dataset = mixup_instance.mixup(
num_samples=100,
method='aitchison',
alpha=0.5,
tree='taxonomy'
)
Compositional Cutmix
Compositional Cutmix swaps data between samples in a way that respects compositional constraints.
python
augmented_dataset = mixup_instance.compositional_cutmix(num_samples=50)
Intra-Class Mixup
This baseline performs Mixup within the same class, ensuring that generated samples are class-consistent.
python
augmented_data = mixup_instance.intra_mixup(
min_threshold=0.1,
max_threshold=0.8,
method='intra_aitchison',
num_samples=150,
alpha=0.6
)
Intra-Class Cutmix
Intra-Class Cutmix swaps specific subtrees within the same class, respecting hierarchical structures.
python
augmented_dataset = mixup_instance.intra_cutmix(
min_threshold=0.1,
max_threshold=0.5,
num_samples=50,
height=3,
num_subtrees=2
)
Examples
Example 1: Phylomix Augmentation
python
augmented_data = mixup_instance.mixup(
num_samples=200,
method='phylomix',
alpha=0.3,
tree='phylogeny'
)
Example 2: Intra-Class Aitchison Mixup
```python augmenteddata = mixupinstance.intramixup( minthreshold=0.1, maxthreshold=0.8, method='intraaitchison', num_samples=150, alpha=0.6 )
```
Notes
- Ensure that the taxonomy and phylogeny trees have the same leaves; otherwise, prune them.
- Use contrastive learning mode for unsupervised tasks.
- Experiment with
alphato control the degree of interpolation.
License
This project is open-source and licensed under the MIT License.
Owner
- Name: BATMEN Lab @ UWaterloo
- Login: batmen-lab
- Kind: user
- Company: UWaterloo CS
- Repositories: 7
- Profile: https://github.com/batmen-lab
GitHub Events
Total
- Watch event: 1
- Push event: 1
Last Year
- Watch event: 1
- Push event: 1
Dependencies
- _libgcc_mutex 0.1.*
- _openmp_mutex 4.5.*
- abseil-cpp 20230802.0.*
- absl-py 1.4.0.*
- aiohttp 3.9.0.*
- aiosignal 1.2.0.*
- anyio 3.5.0.*
- argon2-cffi 23.1.0.*
- argon2-cffi-bindings 21.2.0.*
- arrow 1.2.3.*
- asttokens 2.4.1.*
- async-lru 2.0.4.*
- attrs 23.1.0.*
- babel 2.14.0.*
- backoff 2.2.1.*
- beautifulsoup4 4.12.2.*
- biom-format 2.1.15.*
- blas 1.0.*
- bleach 6.1.0.*
- blessed 1.20.0.*
- blinker 1.6.2.*
- boto3 1.29.1.*
- botocore 1.32.1.*
- bottleneck 1.3.5.*
- brotli 1.0.9.*
- brotli-bin 1.0.9.*
- brotli-python 1.0.9.*
- bzip2 1.0.8.*
- c-ares 1.19.1.*
- ca-certificates 2024.3.11.*
- cachecontrol 0.12.11.*
- cached-property 1.5.2.*
- cached_property 1.5.2.*
- cachetools 4.2.2.*
- certifi 2024.2.2.*
- cffi 1.16.0.*
- charset-normalizer 2.0.4.*
- cleo 2.0.1.*
- click 8.1.7.*
- colorama 0.4.6.*
- comm 0.2.1.*
- contourpy 1.2.0.*
- crashtest 0.4.1.*
- croniter 1.3.7.*
- cryptography 41.0.7.*
- cuda-cudart 12.1.105.*
- cuda-cupti 12.1.105.*
- cuda-libraries 12.1.0.*
- cuda-nvrtc 12.1.105.*
- cuda-nvtx 12.1.105.*
- cuda-opencl 12.3.101.*
- cuda-runtime 12.1.0.*
- cycler 0.11.0.*
- cython 3.0.7.*
- dateutils 0.6.12.*
- dbus 1.13.18.*
- debugpy 1.8.0.*
- decorator 5.1.1.*
- deepdiff 6.7.1.*
- defusedxml 0.7.1.*
- dendropy 4.6.1.*
- distlib 0.3.6.*
- dulwich 0.21.3.*
- entrypoints 0.4.*
- exceptiongroup 1.2.0.*
- executing 2.0.1.*
- expat 2.5.0.*
- fastapi 0.103.0.*
- ffmpeg 4.3.*
- filelock 3.13.1.*
- fonttools 4.25.0.*
- fqdn 1.5.1.*
- freetype 2.12.1.*
- frozenlist 1.4.0.*
- fsspec 2023.10.0.*
- giflib 5.2.1.*
- glib 2.69.1.*
- gmp 6.2.1.*
- gmpy2 2.1.2.*
- gnutls 3.6.15.*
- google-auth 2.22.0.*
- google-auth-oauthlib 0.5.2.*
- grpc-cpp 1.48.2.*
- grpcio 1.48.2.*
- gtest 1.14.0.*
- h11 0.12.0.*
- h2 4.1.0.*
- hdf5 1.12.2.*
- hdmedians 0.14.2.*
- hpack 4.0.0.*
- html5lib 1.1.*
- httpcore 0.15.0.*
- httpx 0.25.1.*
- hyperframe 6.0.1.*
- idna 3.4.*
- importlib-metadata 7.0.1.*
- importlib_metadata 7.0.1.*
- importlib_resources 6.3.1.*
- iniconfig 2.0.0.*
- inquirer 3.1.4.*
- intel-openmp 2023.1.0.*
- ipykernel 6.29.0.*
- ipython 8.20.0.*
- isoduration 20.11.0.*
- itsdangerous 2.0.1.*
- jaraco.classes 3.2.1.*
- jedi 0.19.1.*
- jeepney 0.7.1.*
- jinja2 3.1.2.*
- jmespath 1.0.1.*
- joblib 1.2.0.*
- jpeg 9e.*
- json5 0.9.24.*
- jsonpointer 2.4.*
- jsonschema 4.19.2.*
- jsonschema-specifications 2023.7.1.*
- jsonschema-with-format-nongpl 4.19.2.*
- jupyter-lsp 2.2.4.*
- jupyter_client 7.4.9.*
- jupyter_core 5.7.1.*
- jupyter_events 0.10.0.*
- jupyter_server 2.13.0.*
- jupyter_server_terminals 0.5.3.*
- jupyterlab 4.1.5.*
- jupyterlab_pygments 0.3.0.*
- jupyterlab_server 2.25.4.*
- keyring 23.13.1.*
- keyutils 1.6.1.*
- kiwisolver 1.4.4.*
- krb5 1.21.2.*
- lame 3.100.*
- lcms2 2.12.*
- ld_impl_linux-64 2.38.*
- lerc 3.0.*
- libaec 1.1.2.*
- libbrotlicommon 1.0.9.*
- libbrotlidec 1.0.9.*
- libbrotlienc 1.0.9.*
- libcublas 12.1.0.26.*
- libcufft 11.0.2.4.*
- libcufile 1.8.1.2.*
- libcurand 10.3.4.107.*
- libcurl 8.4.0.*
- libcusolver 11.4.4.55.*
- libcusparse 12.0.2.55.*
- libdeflate 1.17.*
- libedit 3.1.20191231.*
- libev 4.33.*
- libffi 3.4.4.*
- libgcc-ng 13.2.0.*
- libgfortran-ng 11.2.0.*
- libgfortran5 11.2.0.*
- libgomp 13.2.0.*
- libiconv 1.16.*
- libidn2 2.3.4.*
- libjpeg-turbo 2.0.0.*
- libllvm14 14.0.6.*
- libnghttp2 1.52.0.*
- libnpp 12.0.2.50.*
- libnvjitlink 12.1.105.*
- libnvjpeg 12.1.1.14.*
- libpng 1.6.39.*
- libprotobuf 3.20.3.*
- libsodium 1.0.18.*
- libssh2 1.11.0.*
- libstdcxx-ng 13.2.0.*
- libtasn1 4.19.0.*
- libtiff 4.5.1.*
- libunistring 0.9.10.*
- libuuid 1.41.5.*
- libwebp 1.3.2.*
- libwebp-base 1.3.2.*
- libzlib 1.2.13.*
- lightning 2.1.2.*
- lightning-cloud 0.5.57.*
- lightning-utilities 0.9.0.*
- llvm-openmp 14.0.6.*
- llvmlite 0.41.1.*
- lockfile 0.12.2.*
- lz4-c 1.9.4.*
- markdown 3.4.1.*
- markdown-it-py 2.2.0.*
- markupsafe 2.1.3.*
- matplotlib-base 3.8.0.*
- matplotlib-inline 0.1.6.*
- mdurl 0.1.0.*
- mistune 3.0.2.*
- mkl 2023.1.0.*
- mkl-service 2.4.0.*
- mkl_fft 1.3.8.*
- mkl_random 1.2.4.*
- more-itertools 10.1.0.*
- mpc 1.1.0.*
- mpfr 4.0.2.*
- mpmath 1.3.0.*
- msgpack-python 1.0.3.*
- multidict 6.0.4.*
- munkres 1.1.4.*
- natsort 8.4.0.*
- nbclient 0.10.0.*
- nbconvert-core 7.16.2.*
- nbformat 5.10.3.*
- ncurses 6.4.*
- nest-asyncio 1.5.9.*
- nettle 3.7.3.*
- networkx 3.1.*
- notebook 7.1.2.*
- notebook-shim 0.2.4.*
- numba 0.58.1.*
- numexpr 2.8.7.*
- numpy 1.26.3.*
- numpy-base 1.26.3.*
- oauthlib 3.2.2.*
- openh264 2.1.1.*
- openjpeg 2.4.0.*
- openssl 3.2.1.*
- ordered-set 4.1.0.*
- orjson 3.9.10.*
- overrides 7.7.0.*
- packaging 23.1.*
- pandocfilters 1.5.0.*
- parso 0.8.3.*
- pcre 8.45.*
- pexpect 4.8.0.*
- pickleshare 0.7.5.*
- pillow 10.0.1.*
- pip 23.3.1.*
- pkginfo 1.9.6.*
- platformdirs 2.5.2.*
- pluggy 1.3.0.*
- poetry 1.4.0.*
- poetry-core 1.5.1.*
- poetry-plugin-export 1.3.0.*
- prometheus_client 0.20.0.*
- prompt-toolkit 3.0.42.*
- protobuf 3.20.3.*
- psutil 5.9.0.*
- ptyprocess 0.7.0.*
- pure_eval 0.2.2.*
- pyasn1 0.4.8.*
- pyasn1-modules 0.2.8.*
- pycparser 2.21.*
- pydantic 1.10.12.*
- pygments 2.15.1.*
- pyjwt 2.4.0.*
- pynndescent 0.5.10.*
- pyopenssl 23.2.0.*
- pyparsing 3.0.9.*
- pyproject_hooks 1.0.0.*
- pysocks 1.7.1.*
- pytest 7.4.4.*
- python 3.11.5.*
- python-build 0.10.0.*
- python-dateutil 2.8.2.*
- python-editor 1.0.4.*
- python-fastjsonschema 2.19.1.*
- python-installer 0.6.0.*
- python-json-logger 2.0.7.*
- python-multipart 0.0.6.*
- python-tzdata 2023.3.*
- python_abi 3.11.*
- pytorch 2.1.2.*
- pytorch-cuda 12.1.*
- pytorch-lightning 2.0.3.*
- pytorch-mutex 1.0.*
- pytz 2023.3.post1.*
- pyyaml 6.0.1.*
- pyzmq 24.0.1.*
- rapidfuzz 2.13.7.*
- re2 2022.04.01.*
- readchar 4.0.5.*
- readline 8.2.*
- referencing 0.30.2.*
- requests 2.31.0.*
- requests-oauthlib 1.3.0.*
- requests-toolbelt 0.9.1.*
- rfc3339-validator 0.1.4.*
- rfc3986-validator 0.1.1.*
- rich 13.3.5.*
- rpds-py 0.10.6.*
- rsa 4.7.2.*
- s3transfer 0.7.0.*
- scikit-bio 0.5.8.*
- scikit-learn 1.2.2.*
- scipy 1.11.4.*
- secretstorage 3.3.1.*
- send2trash 1.8.2.*
- setuptools 68.2.2.*
- shellingham 1.5.0.*
- six 1.16.0.*
- sniffio 1.2.0.*
- soupsieve 2.5.*
- sqlite 3.41.2.*
- stack_data 0.6.2.*
- starlette 0.27.0.*
- starsessions 1.3.0.*
- sympy 1.12.*
- tbb 2021.8.0.*
- tensorboard-data-server 0.7.0.*
- tensorboard-plugin-wit 1.6.0.*
- terminado 0.18.1.*
- threadpoolctl 2.2.0.*
- tinycss2 1.2.1.*
- tk 8.6.12.*
- tomli 2.0.1.*
- tomlkit 0.11.1.*
- torchaudio 2.1.2.*
- torchmetrics 1.1.2.*
- torchtriton 2.1.0.*
- torchvision 0.16.2.*
- tornado 6.3.3.*
- tqdm 4.65.0.*
- traitlets 5.7.1.*
- trove-classifiers 2023.10.18.*
- typing-extensions 4.9.0.*
- typing_extensions 4.9.0.*
- typing_utils 0.1.0.*
- tzdata 2023d.*
- umap-learn 0.5.3.*
- uri-template 1.3.0.*
- urllib3 1.26.18.*
- uvicorn 0.20.0.*
- virtualenv 20.17.1.*
- wcwidth 0.2.5.*
- webcolors 1.13.*
- webencodings 0.5.1.*
- websocket-client 0.58.0.*
- websockets 10.4.*
- werkzeug 2.2.3.*
- wheel 0.41.2.*
- xz 5.4.5.*
- yaml 0.2.5.*
- yarl 1.9.3.*
- zeromq 4.3.5.*
- zipp 3.17.0.*
- zlib 1.2.13.*
- zstd 1.5.5.*