medusa

The code for the article "Fully Automated Unconstrained Analysis of High-Resolution Mass Spectrometry Data with Machine Learning"

https://github.com/ananikov-lab/medusa

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

The code for the article "Fully Automated Unconstrained Analysis of High-Resolution Mass Spectrometry Data with Machine Learning"

Basic Info

Host: GitHub
Owner: Ananikov-Lab
License: mit
Language: Jupyter Notebook
Default Branch: main
Homepage: http://ananikov-lab.github.io/medusa/
Size: 258 MB

Statistics

Stars: 22
Watchers: 0
Forks: 1
Open Issues: 0
Releases: 0

Created almost 4 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

MEDUSA

Machine learning Enabled Deisotoping and Untargeted Spectra Analysis

MEDUSA toc

Mass spectrometry (MS) is a convenient, highly sensitive, and reliable method for the analysis of complex mixtures, which is vital for materials science, life sciences fields such as metabolomics and proteomics, and mechanistic research in chemistry. Although it is one of the most powerful methods for individual compound detection, complete signal assignment in complex mixtures is still a great challenge. The unconstrained formula-generating algorithm, covering the entire spectra and revealing components, is a dream tool for researchers. We present the framework for efficient MS data interpretation, describing a novel approach for detailed analysis based on deisotoping performed by gradient-boosted decision trees and a neural network that generates molecular formulas from the fine isotopic structure, approaching the long-standing inverse spectral problem. The workflow was successfully tested on three examples: fragment ion analysis in protein sequencing for proteomics, analysis of the natural samples for life sciences, and study of the cross-coupling catalytic system for chemistry.

How to use it?

To start with MEDUSA, first, install the required packages running. It is recommended to create new virtual environment for this purpose.

bash pip install -r requirements.txt

If you want to build docs, you will also have to install Sphinx and furo theme, running

bash pip install sphinx pip install furo

Then you will be able to

bash cd docs make html

Built docs will be in the _build directory

Simple operations with mass-spectra

To open a spectrum run

```python from mass_automation.experiment import Experiment

exp = Experiment('spectrum.mzXML', nscans=128, npoints=6) ```

Individual spectra can be accessed in list-like fashion

python spectrum = exp[0]

Masses and intensities can be accessed manually

python masses = spectrum.masses ints = spectrum.ints

See more details on working with MEDUSA in documentation.

Train element regression/classification models

To prepare training data do the following actions:

Subsample PubChem using script research/formula_generation/scripts/subsample_pubchem.py and generate list of formulas (RDKit is required)
Use research/formula_generation/scripts/honest_subsampling.py to subsample data from the formula list
Generate isotopic distributions using research/formula_generation/scripts/generate_fake_representation_multiprocess.py of its single-process version by running research/formula_generation/scripts/generate_fake_representation.py. These scripts contain some parameters for the spectra generation, which were used in the current study. This may be altered to match your needs ( e.g. lower resolution instrument).

The representations can then be used to train the models. Data, required to reproduce this work, is precalculated and is provided in the training_data.tsv.gz file.

The neural network training code is presented in the research/formula_generation/dl directory. The pretrained model can be loaded:

```python from mass_automation.formula.model import LSTM

model = LSTM.loadfromcheckpoint('pretrained_model.ckpt') ```

See examples on using the models in research/formula_analysis_examples.

Train deisotoping models

To train your own model you have to create your own artificial spectra with research/deisotoping/generate_mixtures.py. After that, you can create your own dataset for learning with research/deisotoping/generate_dataset.py. See examples on using the models in research/deisotoping/MLDeisotoper_vs_LinearDeisotoper.ipynb.

Sample-oriented analysis

This analysis is useful to differentiate spectra and find similarities in them by applying unsupervised ML techniques. See examples on using the algorithms in research/clustering_examples.

Compound presence verification

This algorithm is useful to find isotopic distribution of ion in spectrum with knowing molecular formula. See examples on using the algorithm in research/compound_presence_verification.

Data requirements

Deisotoping, compound presence verification, sample-oriented analysis

These algorithms are not rely on the fine isotopic structure thus can be performed on usual HRMS spectra. However, the deisotoping models in data folder were trained with measurement mass errors typical for FT-ICR MS. To make model suitable for usual HRMS spectra, you have to train the model on synthetic data with bigger mass measurement errors.

Element classification and element regression

These algorithms heavily rely on the fine isotopic structure, observable by using FT-ICR/MS and Orbitrap instruments. The models can be retrained to match specific requirements of any type of the instruments. If fine isotopic structure is not observable, some elements still can be recognised, but model prediction will be significantly complicated.

Current development plan

[ ] Publish MEDUSA on PyPI
[ ] Add support for lower-resolution instruments
[ ] Add support for MS/MS spectra

Where is the data?

All used data (including files required to run tests) is available on MEGA. The tea dataset is currently used in the ongoing research project and can be provided upon request.

How to cite it?

Boiko D.A., Kozlov K.S., Burykina J.V., Ilyushenkova V.V., Ananikov V.P., "Fully Automated Unconstrained Analysis of High-Resolution Mass Spectrometry Data with Machine Learning", J. Am. Chem. Soc., 2022, ASAP https://doi.org/10.1021/jacs.2c03631

Owner

Name: Laboratory of Metal-Complex and Nanoscale Catalysts
Login: Ananikov-Lab
Kind: organization
Email: ALab@ioc.ac.ru
Location: Moscow, Russia

Website: https://ananikovlab.ru
Twitter: AnanikovLab
Repositories: 2
Profile: https://github.com/Ananikov-Lab

Laboratory of Prof. Valentine Ananikov at Russian Academy of Sciences. Interests: molecular complexity and transformations.

GitHub Events

Total

Issues event: 1
Watch event: 9
Push event: 2
Fork event: 3

Last Year

Issues event: 1
Watch event: 9
Push event: 2
Fork event: 3

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 1
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

hliu56 (1)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

Jinja2 ==3.1.2
Markdown ==3.4.1
MarkupSafe ==2.1.1
Pillow ==9.2.0
PyYAML ==6.0
Pygments ==2.12.0
Send2Trash ==1.8.0
Werkzeug ==2.1.2
absl-py ==1.1.0
aiohttp ==3.8.1
aiosignal ==1.2.0
argon2-cffi ==21.3.0
argon2-cffi-bindings ==21.2.0
async-timeout ==4.0.2
asynctest ==0.13.0
attrs ==21.4.0
backcall ==0.2.0
beautifulsoup4 ==4.11.1
bleach ==5.0.1
bokeh ==2.4.3
cachetools ==5.2.0
catboost ==1.0.6
certifi ==2022.6.15
cffi ==1.15.1
charset-normalizer ==2.1.0
chemparse ==0.1.1
cycler ==0.11.0
debugpy ==1.6.2
decorator ==5.1.1
defusedxml ==0.7.1
entrypoints ==0.4
fastjsonschema ==2.16.1
fonttools ==4.34.4
frozenlist ==1.3.0
fsspec ==2022.5.0
google-auth ==2.9.1
google-auth-oauthlib ==0.4.6
grpcio ==1.47.0
idna ==3.3
importlib-metadata ==4.12.0
importlib-resources ==5.8.0
ipykernel ==6.15.1
ipython ==7.34.0
ipython-genutils ==0.2.0
ipywidgets ==7.7.1
jedi ==0.18.1
joblib ==1.1.0
jsonschema ==4.7.2
jupyter-client ==7.3.4
jupyter-core ==4.11.1
jupyterlab-pygments ==0.2.2
jupyterlab-widgets ==1.1.1
kiwisolver ==1.4.4
matplotlib ==3.5.2
matplotlib-inline ==0.1.3
mistune ==0.8.4
multidict ==6.0.2
nbclient ==0.6.6
nbconvert ==6.5.0
nbformat ==5.4.0
nest-asyncio ==1.5.5
notebook ==6.4.12
numpy ==1.21.6
oauthlib ==3.2.0
packaging ==21.3
pandas ==1.3.5
pandocfilters ==1.5.0
parso ==0.8.3
patsy ==0.5.2
pexpect ==4.8.0
pickleshare ==0.7.5
prettytable ==3.3.0
prometheus-client ==0.14.1
prompt-toolkit ==3.0.30
protobuf ==3.19.4
psutil ==5.9.1
ptyprocess ==0.7.0
pyDeprecate ==0.3.2
pyasn1 ==0.4.8
pyasn1-modules ==0.2.8
pycparser ==2.21
pyopenms ==2.7.0
pyparsing ==3.0.9
pyrsistent ==0.18.1
pyteomics ==4.5.3
pytest *
python-dateutil ==2.8.2
pytorch-lightning ==1.6.5
pytz ==2022.1
pyzmq ==23.2.0
requests ==2.28.1
requests-oauthlib ==1.3.1
rsa ==4.8
scikit-learn ==1.0.2
scipy ==1.7.3
six ==1.16.0
soupsieve ==2.3.2.post1
statsmodels ==0.13.2
tensorboard ==2.9.1
tensorboard-data-server ==0.6.1
tensorboard-plugin-wit ==1.8.1
terminado ==0.15.0
threadpoolctl ==3.1.0
tinycss2 ==1.1.1
torch ==1.12.0
torchmetrics ==0.9.2
tornado ==6.2
tqdm ==4.64.0
traitlets ==5.3.0
typing_extensions ==4.3.0
urllib3 ==1.26.10
wcwidth ==0.2.5
webencodings ==0.5.1
widgetsnbextension ==3.6.1
xgboost ==1.2.1
yarl ==1.7.2
zipp ==3.8.1