medusa
The code for the article "Fully Automated Unconstrained Analysis of High-Resolution Mass Spectrometry Data with Machine Learning"
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.6%) to scientific vocabulary
Repository
The code for the article "Fully Automated Unconstrained Analysis of High-Resolution Mass Spectrometry Data with Machine Learning"
Basic Info
- Host: GitHub
- Owner: Ananikov-Lab
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: http://ananikov-lab.github.io/medusa/
- Size: 258 MB
Statistics
- Stars: 22
- Watchers: 0
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
MEDUSA
Machine learning Enabled Deisotoping and Untargeted Spectra Analysis

Mass spectrometry (MS) is a convenient, highly sensitive, and reliable method for the analysis of complex mixtures, which is vital for materials science, life sciences fields such as metabolomics and proteomics, and mechanistic research in chemistry. Although it is one of the most powerful methods for individual compound detection, complete signal assignment in complex mixtures is still a great challenge. The unconstrained formula-generating algorithm, covering the entire spectra and revealing components, is a dream tool for researchers. We present the framework for efficient MS data interpretation, describing a novel approach for detailed analysis based on deisotoping performed by gradient-boosted decision trees and a neural network that generates molecular formulas from the fine isotopic structure, approaching the long-standing inverse spectral problem. The workflow was successfully tested on three examples: fragment ion analysis in protein sequencing for proteomics, analysis of the natural samples for life sciences, and study of the cross-coupling catalytic system for chemistry.
How to use it?
To start with MEDUSA, first, install the required packages running. It is recommended to create new virtual environment for this purpose.
bash
pip install -r requirements.txt
If you want to build docs, you will also have to install Sphinx and furo theme, running
bash
pip install sphinx
pip install furo
Then you will be able to
bash
cd docs
make html
Built docs will be in the _build directory
Simple operations with mass-spectra
To open a spectrum run
```python from mass_automation.experiment import Experiment
exp = Experiment('spectrum.mzXML', nscans=128, npoints=6) ```
Individual spectra can be accessed in list-like fashion
python
spectrum = exp[0]
Masses and intensities can be accessed manually
python
masses = spectrum.masses
ints = spectrum.ints
See more details on working with MEDUSA in documentation.
Train element regression/classification models
To prepare training data do the following actions:
- Subsample PubChem using script
research/formula_generation/scripts/subsample_pubchem.pyand generate list of formulas (RDKit is required) - Use
research/formula_generation/scripts/honest_subsampling.pyto subsample data from the formula list - Generate isotopic distributions
using
research/formula_generation/scripts/generate_fake_representation_multiprocess.pyof its single-process version by runningresearch/formula_generation/scripts/generate_fake_representation.py. These scripts contain some parameters for the spectra generation, which were used in the current study. This may be altered to match your needs ( e.g. lower resolution instrument).
The representations can then be used to train the models. Data, required to reproduce this work, is precalculated and is
provided in the training_data.tsv.gz file.
The neural network training code is presented in the research/formula_generation/dl directory. The pretrained model
can be loaded:
```python from mass_automation.formula.model import LSTM
model = LSTM.loadfromcheckpoint('pretrained_model.ckpt') ```
See examples on using the models in research/formula_analysis_examples.
Train deisotoping models
To train your own model you have to create your own artificial spectra with research/deisotoping/generate_mixtures.py.
After that, you can create your own dataset for learning with research/deisotoping/generate_dataset.py. See examples on using the models in research/deisotoping/MLDeisotoper_vs_LinearDeisotoper.ipynb.
Sample-oriented analysis
This analysis is useful to differentiate spectra and find similarities in them by applying unsupervised ML techniques. See examples on using the algorithms in research/clustering_examples.
Compound presence verification
This algorithm is useful to find isotopic distribution of ion in spectrum with knowing molecular formula. See examples on using the algorithm in research/compound_presence_verification.
Data requirements
Deisotoping, compound presence verification, sample-oriented analysis
These algorithms are not rely on the fine isotopic structure thus can be performed on usual HRMS spectra. However, the deisotoping models in data folder were trained with measurement mass errors typical for FT-ICR MS. To make model suitable for usual HRMS spectra, you have to train the model on synthetic data with bigger mass measurement errors.
Element classification and element regression
These algorithms heavily rely on the fine isotopic structure, observable by using FT-ICR/MS and Orbitrap instruments. The models can be retrained to match specific requirements of any type of the instruments. If fine isotopic structure is not observable, some elements still can be recognised, but model prediction will be significantly complicated.
Current development plan
- [ ] Publish MEDUSA on PyPI
- [ ] Add support for lower-resolution instruments
- [ ] Add support for MS/MS spectra
Where is the data?
All used data (including files required to run tests) is available on MEGA. The tea dataset is currently used in the ongoing research project and can be provided upon request.
How to cite it?
Boiko D.A., Kozlov K.S., Burykina J.V., Ilyushenkova V.V., Ananikov V.P., "Fully Automated Unconstrained Analysis of High-Resolution Mass Spectrometry Data with Machine Learning", J. Am. Chem. Soc., 2022, ASAP https://doi.org/10.1021/jacs.2c03631
Owner
- Name: Laboratory of Metal-Complex and Nanoscale Catalysts
- Login: Ananikov-Lab
- Kind: organization
- Email: ALab@ioc.ac.ru
- Location: Moscow, Russia
- Website: https://ananikovlab.ru
- Twitter: AnanikovLab
- Repositories: 2
- Profile: https://github.com/Ananikov-Lab
Laboratory of Prof. Valentine Ananikov at Russian Academy of Sciences. Interests: molecular complexity and transformations.
GitHub Events
Total
- Issues event: 1
- Watch event: 9
- Push event: 2
- Fork event: 3
Last Year
- Issues event: 1
- Watch event: 9
- Push event: 2
- Fork event: 3
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 1
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- hliu56 (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- Jinja2 ==3.1.2
- Markdown ==3.4.1
- MarkupSafe ==2.1.1
- Pillow ==9.2.0
- PyYAML ==6.0
- Pygments ==2.12.0
- Send2Trash ==1.8.0
- Werkzeug ==2.1.2
- absl-py ==1.1.0
- aiohttp ==3.8.1
- aiosignal ==1.2.0
- argon2-cffi ==21.3.0
- argon2-cffi-bindings ==21.2.0
- async-timeout ==4.0.2
- asynctest ==0.13.0
- attrs ==21.4.0
- backcall ==0.2.0
- beautifulsoup4 ==4.11.1
- bleach ==5.0.1
- bokeh ==2.4.3
- cachetools ==5.2.0
- catboost ==1.0.6
- certifi ==2022.6.15
- cffi ==1.15.1
- charset-normalizer ==2.1.0
- chemparse ==0.1.1
- cycler ==0.11.0
- debugpy ==1.6.2
- decorator ==5.1.1
- defusedxml ==0.7.1
- entrypoints ==0.4
- fastjsonschema ==2.16.1
- fonttools ==4.34.4
- frozenlist ==1.3.0
- fsspec ==2022.5.0
- google-auth ==2.9.1
- google-auth-oauthlib ==0.4.6
- grpcio ==1.47.0
- idna ==3.3
- importlib-metadata ==4.12.0
- importlib-resources ==5.8.0
- ipykernel ==6.15.1
- ipython ==7.34.0
- ipython-genutils ==0.2.0
- ipywidgets ==7.7.1
- jedi ==0.18.1
- joblib ==1.1.0
- jsonschema ==4.7.2
- jupyter-client ==7.3.4
- jupyter-core ==4.11.1
- jupyterlab-pygments ==0.2.2
- jupyterlab-widgets ==1.1.1
- kiwisolver ==1.4.4
- matplotlib ==3.5.2
- matplotlib-inline ==0.1.3
- mistune ==0.8.4
- multidict ==6.0.2
- nbclient ==0.6.6
- nbconvert ==6.5.0
- nbformat ==5.4.0
- nest-asyncio ==1.5.5
- notebook ==6.4.12
- numpy ==1.21.6
- oauthlib ==3.2.0
- packaging ==21.3
- pandas ==1.3.5
- pandocfilters ==1.5.0
- parso ==0.8.3
- patsy ==0.5.2
- pexpect ==4.8.0
- pickleshare ==0.7.5
- prettytable ==3.3.0
- prometheus-client ==0.14.1
- prompt-toolkit ==3.0.30
- protobuf ==3.19.4
- psutil ==5.9.1
- ptyprocess ==0.7.0
- pyDeprecate ==0.3.2
- pyasn1 ==0.4.8
- pyasn1-modules ==0.2.8
- pycparser ==2.21
- pyopenms ==2.7.0
- pyparsing ==3.0.9
- pyrsistent ==0.18.1
- pyteomics ==4.5.3
- pytest *
- python-dateutil ==2.8.2
- pytorch-lightning ==1.6.5
- pytz ==2022.1
- pyzmq ==23.2.0
- requests ==2.28.1
- requests-oauthlib ==1.3.1
- rsa ==4.8
- scikit-learn ==1.0.2
- scipy ==1.7.3
- six ==1.16.0
- soupsieve ==2.3.2.post1
- statsmodels ==0.13.2
- tensorboard ==2.9.1
- tensorboard-data-server ==0.6.1
- tensorboard-plugin-wit ==1.8.1
- terminado ==0.15.0
- threadpoolctl ==3.1.0
- tinycss2 ==1.1.1
- torch ==1.12.0
- torchmetrics ==0.9.2
- tornado ==6.2
- tqdm ==4.64.0
- traitlets ==5.3.0
- typing_extensions ==4.3.0
- urllib3 ==1.26.10
- wcwidth ==0.2.5
- webencodings ==0.5.1
- widgetsnbextension ==3.6.1
- xgboost ==1.2.1
- yarl ==1.7.2
- zipp ==3.8.1