hammer
Hierarchical Augmented Multi-Task Multi-Modal classifiER for fingerprint-based SMILES classifications 🔨
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
â—‹CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
â—‹DOI references
-
â—‹Academic publication links
-
â—‹Academic email domains
-
â—‹Institutional organization owner
-
â—‹JOSS paper metadata
-
â—‹Scientific vocabulary similarity
Low similarity (15.4%) to scientific vocabulary
Keywords
Repository
Hierarchical Augmented Multi-Task Multi-Modal classifiER for fingerprint-based SMILES classifications 🔨
Basic Info
Statistics
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 7
- Releases: 0
Topics
Metadata Files
README.md
Hammer
Hammer is a Hierarchical Augmented Multi-modal Multi-task classifiER that, given a SMILE as input, computes selected fingerprints and predicts its associated taxonomical ranking.
The classifier can employ a diverse set of molecular fingerprints and descriptors to capture the unique features of the SMILES and enable accurate predictions across multiple tasks.
Furthermore, the model can be beheaded (remove the output layers) and used either as a feature extractor or as a pre-trained model for transfer learning on other tasks. This package provides also tooling to extract and visualize all of the features used in the model, which can be used to train other models or to perform downstream analyses. If you intend to use this model for transfer learning, pay attention to not include in your test set SMILEs used for training this model to avoid biasing your evaluations!
Installation
This library will be available to install via pip, but for now you can install it by cloning the repository and running the following command:
bash
pip install .
Command line interface and usage
While the package can be entirely used as a library, it also provides a command line interface that can be used to perform a variety of tasks and reproduce the experiments that we have conducted or design new ones.
In the following sections, we will describe the usage of the command line interface of the Hammer package. These commands are readily available after installing the package, no additional setup is required.
Feature visualization
To visualize the features used in the model using PCA and t-SNE, you can run the following command:
bash
hammer visualize --verbose --dataset NPC --output-directory "data_visualizations" --image-format "png"
This will generate a set of plots that show the distribution of the features used in the model. The plots will be saved in the data_visualizations directory in the png format. You can change the output directory and image format by changing the --output-directory and --image-format arguments, respectively. The resulting plots will look like the following (this one illustrates the t-SNE and PCA decomposition of the Topological Torsion 1024 bits):
It is also possible to visualize specific feature sets, for example the MAP4 features, by using the --include-map4 argument:
bash
hammer visualize --verbose\
--dataset NPC\
--include-map4\
--output-directory "data_visualizations"\
--image-format "png"
bash
hammer visualize --verbose\
--dataset GNPS\
--output-directory "data_visualizations/gnps"\
--image-format "png"
DAG Coverage
One of the goals of this project is to, over time and with the help of the community, increase the overall number of pathways, superclasses, and classes that the model can predict. The model employs as a form of static attention a DAG that harmonizes the predictions of the different tasks. At this time, the dataset we are using DOES NOT cover all of the combinations of pathways, superclasses and classes that the DAG allows for. We aim to increase the coverage of the DAG over time, and we welcome contributions to the dataset that can help us achieve this goal. We are starting out from the dataset made available by NP Classifier.
You can compute a summary of the coverage of the DAG using the following command:
bash
hammer dag-coverage --dataset NPC --verbose
hammer dag-coverage --dataset NPCHarmonized --verbose
At the time of writing, the coverage of the DAG is as follows:
| Dataset | Layer | Coverage | |:--------------|:-------------|-----------:| | NPC | pathways | 1 | | NPC | superclasses | 0.922078 | | NPC | classes | 0.936782 | | NPC | DAG | 0.730228 | | NPCHarmonized | pathways | 1 | | NPCHarmonized | superclasses | 0.948052 | | NPCHarmonized | classes | 0.95546 | | NPCHarmonized | DAG | 0.813651 |
Feature sets evaluation
To evaluate the feature sets used in the model, you can run the following command. This will perform a 10-fold cross-validation evaluation of the feature sets. The performance for all holdouts and all considered features will be saved in the feature_sets_evaluation.csv file, while the barplots will be saved in the feature_sets_evaluation_barplots directory.
The dataset is split using first a stratified split by the rarest class, then subsequently holdouts number of stratified Monte Carlo splits into sub-training and validation. The test set is not touched during this evaluation process, as we will use it to evaluate the model over the selected feature set.
The model used for these evaluations is the same Hammer model that is used for the predictions, changing only the number of input feature sets.
bash
hammer feature-sets-evaluation \
--verbose \
--holdouts 10 \
--dataset NPC \
--test-size 0.2 \
--validation-size 0.2 \
--performance-path "performance/feature_sets_evaluation.csv" \
--training-directory "training/feature_selection" \
--barplot-directory "barplots/feature_sets_evaluation"
bash
hammer feature-sets-evaluation \
--verbose \
--holdouts 10 \
--dataset NPCHarmonized \
--test-size 0.2 \
--validation-size 0.2 \
--performance-path "performance/feature_sets_evaluation_harmonized.csv" \
--training-directory "training/feature_selection_harmonized" \
--barplot-directory "barplots/feature_sets_evaluation_harmonized"
bash
hammer holdouts \
--verbose \
--holdouts 10 \
--dataset GNPS \
--test-size 0.2 \
--validation-size 0.2 \
--performance-path "performance/feature_sets_evaluation_gnps.csv" \
--training-directory "training/feature_selection_gnps" \
--barplot-directory "barplots/feature_sets_evaluation_gnps"
Executing this command will generate the barplots you can find in this directory. In the following barplot, you will find the AUPRC for each class, for validation, test a, for each feature set, averaged over all holdouts:
In the following table, we illustrate the mean and standard deviation of the validation AUPRC for the different feature sets.
| Feature Set | V2 Mean | V2 STD | V1 Mean | V1 Std | V0 Mean | V0 Std | |-----------------------------------------------|----------|----------|----------|----------|----------------|----------------| | Atom Pair (2048b) | 0.937478 | 0.031032 | 0.927321 | 0.009292 | 0.857813 | 0.002735 | | Auto-Correlation | 0.842348 | 0.039879 | 0.869845 | 0.021085 | 0.811050 | 0.004208 | | Avalon (2048b) | 0.944524 | 0.023624 | 0.919810 | 0.044258 | 0.900124 | 0.003449 | | Extended Connectivity (2r, 2048b) | 0.952927 | 0.001753 | 0.935591 | 0.002339 | 0.884863 (r=1) | 0.003651 (r=1) | | Functional Groups | 0.577002 | 0.015732 | 0.572862 | 0.019137 | 0.589513 | 0.015186 | | Ghose-Crippen | 0.642189 | 0.020406 | 0.629073 | 0.021072 | 0.659410 | 0.001983 | | Laggner | 0.812406 | 0.010464 | 0.805593 | 0.014127 | 0.759160 | 0.012833 | | Layered (2048b) | 0.946703 | 0.002667 | 0.929109 | 0.013407 | 0.898108 | 0.003496 | | Lingo (1024b) | 0.927148 | 0.001787 | 0.915304 | 0.002246 | 0.837630 | 0.002075 | | MACCS | 0.849523 | 0.016972 | 0.848236 | 0.015830 | 0.810575 | 0.005415 | | MAP4 | 0.945194 | 0.002993 | 0.934615 | 0.002147 | 0.855033 | 0.005111 | | MinHashed (2r, 2048b) | 0.939163 | 0.001409 | 0.928356 | 0.002958 | 0.839570 | 0.009128 | | Molecular Quantum Numbers | 0.617027 | 0.024393 | 0.614444 | 0.029922 | 0.672846 | 0.008695 | | Pattern (2048b) | 0.934383 | 0.036935 | 0.892704 | 0.043349 | 0.893990 | 0.005862 | | PubChem | 0.936316 | 0.005425 | 0.911609 | 0.025582 | 0.885968 | 0.003264 | | RDKit (2048b) | 0.942030 | 0.002838 | 0.927816 | 0.003440 | 0.871136 | 0.008087 | | SMILES Extended Connectivity (1r, 2048b) | 0.881445 | 0.004091 | 0.870877 | 0.007103 | 0.827976 | 0.004601 | | Topological Torsion (1024b) | 0.946040 | 0.002092 | 0.931260 | 0.002015 | 0.863581 | 0.002356 | | Van Der Waals Surface Area | 0.769931 | 0.084615 | 0.817110 | 0.063946 | 0.795765 | 0.006069 |
It is also possible to run the feature-sets-evaluation on a subset of features:
bash
hammer feature-sets-evaluation \
--verbose \
--holdouts 5 \
--dataset NPC \
--include-map4 \
--test-size 0.2 \
--validation-size 0.2 \
--performance-path "performance/map4_feature_evaluation.csv" \
--training-directory "training/map4_feature" \
--barplot-directory "barplots/map4_feature_evaluation"
Features sets synergy
After having evaluated the feature sets for a given dataset, it remains open the question of how the feature sets interact with each other. It may very well be that the performance of the model is not simply the sum of the performance of the individual feature sets, but that there is a synergy between them, or that by extending the input space with redoundant features we may actually decrease the performance of the model by excessively increasing the dimensionality of the input space, thus making the model more prone to overfitting.
This approach fixes a subset of the feature sets as the base feature sets, and then iterates on all of the low-dimensionality (less than 1024) feature sets, adding them one by one to the base feature sets. The performance of the model is then evaluated on the validation set, and the performance of the model is saved in the feature_sets_synergy_training.csv file, while the barplots will be saved in the feature_sets_synergy_barplots directory.
We pick the base feature sets as the layered feature set, as it is nearly the best performing feature set, and differently from Avalon, we know fully how it is computed while there is no paper for the Avalon fingerprints.
bash
hammer feature-sets-synergy \
--verbose \
--holdouts 10 \
--dataset NPC \
--base-feature-sets "extended_connectivity" \
--test-size 0.2 \
--validation-size 0.2 \
--performance-path "performance/synergy/extended_connectivity.csv" \
--training-directory "trainings/synergy/extended_connectivity" \
--barplot-directory "barplots/synergy/extended_connectivity"
For the NPC dataset, we have identified that the secondary feature most synergistic (has the best validation AUPRC) with the base feature sets and also the smallest feature size is the Van Der Waals Surface Area, as illustrated in the following barplot:
| Feature Set + Layered | Mean | Std | Feature size | |---------------------------------|------------|-----------|------------------| | Auto-Correlation | 0.914921 | 0.002188 | 192 | | Functional Groups | 0.905355 | 0.004740 | 85 | | Ghose-Crippen | 0.905958 | 0.004123 | 110 | | Laggner | 0.907647 | 0.004016 | 307 | | MACCS | 0.906761 | 0.010752 | 166 | | Molecular Quantum Numbers | 0.909889 | 0.003788 | 42 | | PubChem | 0.913653 | 0.003556 | 881 | | Van Der Waals Surface Area | 0.914309 | 0.004208 | 47 |
We can now proceed to identify the tertiary feature set that is most synergistic with the base feature sets and the secondary feature set. We cannot simply pick the next secondary feature set that is most synergistic with the base feature sets, as this would not take into account the interaction between the secondary and tertiary feature sets, and the increased dimensionality of the input space. We need to evaluate the performance of the model on the validation set for all possible combinations of the base, secondary and tertiary feature sets, and select the one that has the best performance, if there is still an improvement in the performance of the model.
bash
hammer feature-sets-synergy \
--verbose \
--holdouts 10 \
--dataset NPC \
--base-feature-sets "layered" "van_der_waals_surface_area" \
--test-size 0.2 \
--validation-size 0.2 \
--performance-path "tertiary_feature_sets_synergy_training.csv" \
--training-directory "tertiary_feature_sets_synergy_training" \
--barplot-directory "tertiary_feature_sets_synergy_barplots"
Train a model variant
bash
hammer train \
--verbose \
--dataset NPCHarmonized \
--include-extended-connectivity \
--include-layered \
--include-topological-torsion \
--include-van-der-waals-surface-area \
--include-molecular-quantum-numbers \
--test-size 0.2 \
--training-directory "npc.harmonized.v2.tar.gz"
Predict
You can run predictions for a single SMILES using the following command:
bash
hammer predict \
--input "CN1[C@H]2CC[C@@H]1[C@@H](C(OC)=O)[C@@H](OC(C3=CC=CC=C3)=O)C2" \
--version npc.harmonized.v1
which will output:
bash
SMILES: CN1[C@H]2CC[C@@H]1[C@@H](C(OC)=O)[C@@H](OC(C3=CC=CC=C3)=O)C2
Alkaloids (0.9942)
Ornithine alkaloids (0.9988)
Tropane alkaloids (0.9999)
Analogously, by running the following command for a multi-class compound:
bash
hammer predict \
--input "CCC(C)C1NC(=O)C(Cc2ccccc2)N(C)C(=O)C(C(C)CC)N2C(=O)C(CCC2OC)NC(=O)C(CCCN=C(N)N)NC(=O)C(NC(=O)C(CO)OS(=O)(=O)O)C(C)OC1=O" \
--version npc.harmonized.v1
you will get the following output:
bash
SMILES: CCC(C)C1NC(=O)C(Cc2ccccc2)N(C)C(=O)C(C(C)CC)N2C(=O)C(CCC2OC)NC(=O)C(CCCN=C(N)N)NC(=O)C(NC(=O)C(CO)OS(=O)(=O)O)C(C)OC1=O
Amino acids and Peptides (0.9807)
Oligopeptides (0.9994)
Polyketides (0.9821)
Oligopeptides (0.9994)
Cyclic peptides (0.9999)
Depsipeptides (0.9998)
Ahp-containing cyclodepsipeptides (0.9385)
You can also run predictions for SMILES from a CSV, TSV or SSV file:
bash
hammer predict \
--input "divergent_npc_entries/divergent_pathways.csv" \
--version npc.harmonized.v1 \
--verbose \
--output-dir "divergent_npc_entries/npc.harmonized.v1/"
It is even possible to run predictions directly on the SMILES in an MGF file metadata.
In such a case, you must specify that you want to run a prediction on the MGF file metadata,
and not the Spectra themselves, by using the --only-smiles flag:
bash
hammer predict \
--input "matchms.mgf" \
--model-path "npc.harmonized.v2.tar.gz" \
--verbose \
--output-dir "matchms_predictions" \
--only-smiles
Citation
If you use this model in your research, please cite us:
[TODO: we still need to properly publish the model, so this is a placeholder and will be updated in the future]
bibtex
@software{hammer,
author = {Cappelletti, Luca, et al.},
title = {Hammer: Hierarchical Augmented Multi-modal Multi-task classifiER},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/LucaCappelletti94/hammer}},
}
Contributing
If you want to contribute to this project, please read the CONTRIBUTING file for details on our code of conduct, and the process for submitting pull requests to us.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Owner
- Name: The Earth Metabolome Initiative
- Login: earth-metabolome-initiative
- Kind: organization
- Email: data@earthmetabolome.org
- Repositories: 1
- Profile: https://github.com/earth-metabolome-initiative
Code Repository of the Earth Metabolome Initiative
GitHub Events
Total
- Issues event: 2
- Watch event: 1
- Push event: 1
- Pull request event: 2
- Fork event: 1
Last Year
- Issues event: 2
- Watch event: 1
- Push event: 1
- Pull request event: 2
- Fork event: 1
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 3
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 3
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- Adafede (2)
Pull Request Authors
- Adafede (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- anytree *
- barplots >=1.2.0
- cache_decorator *
- compress_json *
- compress_pickle *
- downloaders *
- environments_utils >=1.0.14
- extra-keras-metrics >=2.0.12
- keras *
- map4 *
- numpy *
- plot-keras-history *
- pydot *
- rdkit *
- scikit-fingerprints *
- scikit-learn *
- silence_tensorflow *
- typeguard *
.png?raw=true)

