ms2query

MS2Query - machine learning assisted library querying of MS/MS spectra

https://github.com/iomega/ms2query

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.7%) to scientific vocabulary

Keywords from Contributors

fuzzy-matching fuzzy-search mass-spectrometry metabolomics word2vec
Last synced: 6 months ago · JSON representation ·

Repository

MS2Query - machine learning assisted library querying of MS/MS spectra

Basic Info
  • Host: GitHub
  • Owner: iomega
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 80.5 MB
Statistics
  • Stars: 47
  • Watchers: 6
  • Forks: 12
  • Open Issues: 20
  • Releases: 51
Created over 5 years ago · Last pushed 10 months ago
Metadata Files
Readme Changelog Contributing License Citation

README.md

GitHub Workflow Status GitHub PyPI fair-software.eu DOI

MS2Query - Reliable and fast MS/MS spectral-based analogue search

Contents

Overview

The publication can be found here: https://rdcu.be/c8Hkc Please cite this article when using MS2Query

MS2Query uses MS2 mass spectral data to find the best match in a library and is able to search for both analogues and exact matches. A pretrained library for MS2Query is available based on the GNPS library. In our benchmarking we show that MS2Query performs better compared to current standards in the field like Cosine Score and the Modified Cosine score. MS2Query is easy to install (see below) and is scalable to large numbers of MS2 spectra.

Workflow

MS2Query is a tool for MSMS library matching, searching both for analogues and exact matches in one run. The workflow for running MS2Query first uses MS2Deepscore to calculate spectral similarity scores between all library spectra and a query spectrum. By using pre-computed MS2Deepscore embeddings for library spectra, this full-library comparison can be computed very quickly. The top 2000 spectra with the highest MS2Deepscore are selected. In contrast to other analogue search methods, no preselection on precursor m/z is performed. MS2Query optimizes re-ranking the best analogue or exact match at the top by using a random forest that combines 5 features. The random forest predicts a score between 0 and 1 between each library and query spectrum and the highest scoring library match is selected. By using a minimum threshold for this score, unreliable matches are filtered out.

For questions regarding MS2Query please make an issue on github or contact niek.dejonge@wur.nl

Installation guide

Prepare environmnent

We recommend to create an Anaconda environment with

conda create --name ms2query python=3.9 conda activate ms2query

Pip install MS2Query

MS2Query can simply be installed by running: pip install ms2query All dependencies are automatically installed, the dependencies can be found in setup.py. The installation is expected to take about 2 minutes. MS2Query is tested by continous integration on MacOS, Windows and Ubuntu for python version 3.9 and 3.1

Run MS2Query from command line

Download default library

When running for the first time a pretrained ms2query library should be downloaded. Change the file locations to the location where the library should be stored. Change the --ionmode to the needed ionmode (positive or negative)

console ms2query --library .\folder_to_store_the_library --download --ionmode positive

Alternatively all model files can be manually downloaded from https://zenodo.org/record/6124552 for positive mode and https://zenodo.org/record/7104184 for negative mode.

Preprocessing mass spectra

MS2Query is run on all MS2 spectra in a spectrum file. MS2Query does not do any peak picking or clustering of similar MS2 spectra. If your files contain many MS2 spectra per feature it is advised to first reduce the number of MS2 spectra by clustering or feature selection. There are multiple tools available that do this. One reliable method is using MZMine for preprocessing, https://mzmine.github.io/mzmine_documentation/index.html. As input for MS2Query you can use the MGF file of the FBMN output of MZMine, see https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking-with-mzmine2/.

Running MS2Query

After downloading a default library MS2Query can be run on your MS2 spectra. Run the command below and specify the location where your spectra are stored. If a spectrum file is specified all spectra in this folder will be processed. If a folder is specified all spectrum files within this folder will be processed. The results generated by MS2Query, are stored as csv files in a results directory within the same directory as your query spectra.

console ms2query --spectra .\location_of_spectra --library .\library_folder --ionmode positive To do a test run with dummy data you can download the file dummy_spectra.mgf. The expected results can be found in expectedresultsdummy_data.csv. After downloading the library files, running on the dummy data is expected to take less than half a minute.

Run ms2query --help for more info/options, or see below:

```console usage: MS2Query [-h] [--spectra SPECTRA] --library LIBRARY_FOLDER [--ionmode {positive,negative}] [--download] [--results RESULTS] [--filter_ionmode]

MS2Query is a tool for MSMS library matching, searching both for analogues and exact matches in one run

optional arguments: -h, --help show this help message and exit --spectra SPECTRA The MS2 query spectra that should be processed. If a directory is specified all spectrum files in the directory will be processed. Accepted formats are: "mzML", "json", "mgf", "msp", "mzxml", "usi" or a pickled matchms object --library LIBRARYFOLDER The directory containing the library spectra (in sqlite), models and precalculated embeddings, to download add --download --ionmode {positive,negative} Specify the ionization mode used --download This will download the most up to date model and library.The model will be stored in the folder given as the second argumentThe model will be downloaded in the in the ionization mode specified under --mode --results RESULTS The folder in which the results should be stored. The default is a new results folder in the folder with the spectra --filterionmode Filter out all spectra that are not in the specified ion-mode. The ion mode can be specified by using --ionmode --addionalmetadata Return additional metadata columns in the results, for example --additionalmetadata retentiontime featureid ```

Interpretation of results

As output a csv file is returned, an example of results can be found in expectedresultsdummy_data.csv. For each of your input spectra MS2Query predicts a library match. It is important to check the ms2querymodelprediction column. This column contains a score, which indicates the likelihood that the found match is a good match. This score ranges between 0 and 1, the closer this score is to 1 the more likely that it is a good match/analogue. It is important to use this score to select only the reliable hits, since a prediction is given for each spectrum, regardless of the ms2query score. There is no strict minimum for this score, but the higher the MS2Query model prediction the more likely it is a good match/analogue. It will depend on your research goal, what a good threshold is. If a high recall is important you might want a low threshold and if a high reliability is more important you might want a high threshold. To give a general indication, a score > 0,7 has many good analogues and exact matches. In the range of 0.6-0.7, the results can still be useful, but should be analysed with more caution and results below 0.6 can often best be discarded.

MS2Query does not need two different workflows for searching for analogues and searching for exact matches, it automatically selects the most likely library spectra. If it is important to separate potential exact matches from potential analogues for your research question, the column with the precursor mz difference can be used to separate these results, since exact matches should have no precursor mz difference. The columns completely to the right are estimated molecular classes based on the molecular structure of the predicted library molecule, these columns can be used to get a quick overview of the kind of compounds that were found.

Build MS2Query into other tools

If you want to incorporate MS2Query into another tool it might be easier to run MS2Query from a python script, instead of running from the command line. The guide below can be used as a starting point.

Below you can find an example script for running MS2Query. Before running the script, replace the variables ms2query_library_files_directory and ms2_spectra_directory with the correct directories.

This script will first download files for a default MS2Query library. This default library is trained on the GNPS library from 2021-15-12.

After downloading, a library search and an analog search is performed on the query spectra in your directory (ms2_spectra_directory). The results generated by MS2Query, are stored as csv files in a results directory within the same directory as your query spectra.

```python from ms2query.runms2query import downloadzenodofiles, runcompletefolder from ms2query.ms2library import createlibraryobjectfromonedir

Set the location where downloaded library and model files are stored

ms2querylibraryfilesdirectory = "./ms2querylibrary_files"

Define the folder in which your query spectra are stored.

Accepted formats are: "mzML", "json", "mgf", "msp", "mzxml", "usi" or a pickled matchms object.

ms2spectradirectory = ion_mode = # Fill in "positive" or "negative" to indicate for which ion mode you would like to download the library

Downloads pretrained models and files for MS2Query (>2GB download)

downloadzenodofiles(ionmode, ms2querylibraryfilesdirectory)

Create a MS2Library object

ms2library = createlibraryobjectfromonedir(ms2querylibraryfilesdirectory)

Run library search and analog search on your files.

runcompletefolder(ms2library, ms2spectradirectory)

```

Create your own library (without training new models)

The code below creates all required library files for your own in house library. No new models for MS2deepscore, Spec2Vec and MS2Query will be trained, to do this see the next section.

First install MS2Query (see above under installation guide) To create your own library you also need to install RDKit, by running the following in your command line (while in the ms2query conda environment): conda install -c conda-forge rdkit

It is important that the library spectra are annotated with smiles, inchi's or inchikeys in the metadata otherwise they are not included in the library.

Fill in the blank spots with the file locations. The models for spec2vec, ms2deepscore and ms2query can be downloaded from the zenodo links (see above).

```python import os from ms2query.createnewlibrary.libraryfilescreator import LibraryFilesCreator from ms2query.cleanandfilterspectra import cleannormalizeandsplitannotatedspectra from ms2query.utils import loadmatchmsspectrumobjectsfromfile, selectfilesindirectory from ms2query.runms2query import downloadzenodofiles from ms2query.ms2library import selectfilesforms2query

The file location of your library spectra

spectrumfilelocation = "../tests/testfiles/generaltestfiles/100test_spectra.mgf"

The ionisation mode, choose between "positive" or "negative"

ionisation_mode = "positive"

Specify the direcory in which the models will be downloaded and the library will be stored.

directoryforlibraryandmodels = "./ms2query_library"

Downloads the models:

downloadzenodofiles(ionisationmode, directoryforlibraryandmodels, onlymodels=True) libraryspectra = loadmatchmsspectrumobjectsfromfile(spectrumfilelocation)

filesindirectory = selectfilesindirectory(directoryforlibraryandmodels) dictwithfilenames = selectfilesforms2query(filesindirectory, ["s2vmodel", "ms2dsmodel"]) ms2dsmodelfilename = os.path.join(directoryforlibraryandmodels, dictwithfilenames["ms2dsmodel"]) s2vmodelfilename = os.path.join(directoryforlibraryandmodels, dictwithfilenames["s2v_model"])

Fill in the missing values:

cleanedlibraryspectra = cleannormalizeandsplitannotatedspectra(libraryspectra, ionmodetokeep=ionisationmode)[0]

librarycreator = LibraryFilesCreator(cleanedlibraryspectra, outputdirectory=directoryforlibraryandmodels, ms2dsmodelfilename=ms2dsmodelfilename, s2vmodelfilename=s2vmodelfilename, ) librarycreator.createalllibraryfiles() ```

To run MS2Query on your own created library. Check out the instructions under Run MS2Query. Both command line and the code version should work. Make sure that the downloaded models and the SQLite file, S2V embeddings file and the ms2ds embeddings file, just generated by you, are in the same directory. The results will be returned as csv files in a results directory.

An alternative for loading in a ms2library is by specifying each file type manually, which is needed if not all file are in one dir or if the library files or models have unexpected names.

```python from ms2query.ms2library import MS2Library from ms2query.runms2query import runcomplete_folder

ms2spectradirectory = # Fill in the location of your query spectra

Specify all the file locations

ms2library = MS2Library(sqlitefilename= , s2vmodelfilename= , ms2dsmodelfilename= , pickleds2vembeddingsfilename= , pickledms2dsembeddingsfilename= , ms2querymodelfilename= , classifiercsvfilename= , #Leave None if not available ) runcompletefolder(ms2library, ms2spectradirectory)

```

Create your own library and train new models

The code trains new MS2Deepscore, Spec2Vec and MS2Query models for your in house library, and creates all needed files for running MS2Query.

It is important that the library spectra are annotated with smiles, inchi's or inchikeys in the metadata otherwise they are not included in the library and training.

Fill in the blank spots below and run the code (can take several days). The models will be stored in the specified output_folder. MS2Query can be run

python from ms2query.create_new_library.train_models import clean_and_train_models clean_and_train_models(spectrum_file=, #Fill in the location of the file containing the library spectra # Accepted formats are: "mzML", "json", "mgf", "msp", "mzxml", "usi" or a pickled matchms object. ion_mode=, # Fill in the ion mode, choose from "positive" or "negative" output_folder= # The output folder in which all the models are stored. )

To run MS2Query on your own created library run the code below (again fill in the blanks).

```python from ms2query.runms2query import runcompletefolder from ms2query.ms2library import createlibraryobjectfromonedir

Define the folder in which your query spectra are stored.

Accepted formats are: "mzML", "json", "mgf", "msp", "mzxml", "usi" or a pickled matchms object.

ms2spectradirectory = # Specify the folder containing the query spectra you want to run against the library ms2librarydirectory = # Specify the directory containing all the library and model files

Create a MS2Library object from one directory

If this does not work (because files have unexpected names or are not in one dir) see below.

ms2library = createlibraryobjectfromonedir(ms2library_directory)

Run library search and analog search on your files.

runcompletefolder(ms2library, ms2spectradirectory) ```

After running you can run MS2Query on your newly created models and library. See above on how to run MS2Query.

Documentation for developers

Prepare environmnent

We recommend to create an Anaconda environment with

conda create --name ms2query python=3.9 conda activate ms2query

Clone repository

Clone the present repository, e.g. by running git clone https://github.com/iomega/ms2query.git And then install the required dependencies, e.g. by running the following from within the cloned directory pip install -e . To run all unit tests, to check if everything was installed successfully run: pytest

Recreate Results Manuscript

To recreate the results in Figure 2 in the MS2Query manuscript download the in between results from https://zenodo.org/record/7427094. Install MS2Query as described above (no need to download the models) and run the code below. This code should work with version 0.5.7. ```python from ms2query.benchmarking.createaccuracyvsrecallplot import create_plot

The folder where the benchmarking results are stored. These can be downloaded from https://zenodo.org/record/7427094

basefolder = "./benchmarkingresults"

createplot(exactmatches=True, # Change to switch between the plot for the exact matches test st and the analogues test set positive=False, # Change to switch between the positive and negative ionization mode results recalculatemeans=True, savefigure=False, # If you want to save the figure, change to true basefolder=basefolder ) ```

Above code only recreates the figures based on the already generated test results. To reproduce the test results from scratch the models have to be retrained. The test split is random and the models trained have a random component to it, so the results could vary sligtly, but the general conclusions from the results are expected to be the same. From https://zenodo.org/record/7427094 download the ALLGNPSNO_PROPOGATED.mgf file to use the same starting data as used for the 20-fold cross-validation, this set was downloaded from https://gnps-external.ucsd.edu/gnpslibrary on 01-11-2022, alternatively a more recent version could be downloaded. To redo the analysis with exactly the same test split as in the Manuscript the test sets can be downloaded from https://zenodo.org/record/7427094 the training data can be constructed by combining the 19 other test sets together for each of the 20 data splits.

If you want to randomly recreate the test splits from scratch run:

```python from ms2query.benchmarking.kfoldcrossvalidation import splitandstoreannotatedunannotated, splitkfoldcrossvalidationanaloguetestset, splitkfoldcrossvalidationexactmatchtestset

spectrumfilename = "./ALLGNPSNOPROPOGATED.mgf" splitandstoreannotatedunannotated(spectrumfilename, ionmode="positive", outputfolder="./positivemodedatasplit") splitandstoreannotatedunannotated(spectrumfilename, ionmode="negative", outputfolder="./negativemodedata_split")

positiveannotatedspectra = loadmatchmsspectrumobjectsfromfile("positivemodedatasplit/annotatedtrainingspectra.pickle") negativeannotatedspectra = loadmatchmsspectrumobjectsfromfile("negativemodedatasplit/annotatedtrainingspectra.pickle")

Run for positive mode spectra

splitkfoldcrossvalidationanaloguetestset(positiveannotatedspectra, 20, outputfolder = "./positivemode/analoguetestsetssplits/) splitkfoldcrossvalidationexactmatchtestset(positiveannotatedspectra, 20, outputfolder = "./positivemode/exactmatchestestsetssplits/)

Run for negative mode spectra

splitkfoldcrossvalidationanaloguetestset(negativeannotatedspectra, 20, outputfolder = "./negativemode/analoguetestsetssplits/) splitkfoldcrossvalidationexactmatchtestset(negativeannotatedspectra, 20, outputfolder = "./negativemode/exactmatchestestsetssplits/)

The 20 different datasplits will be stored in the specified folders

```

To train the models and to create the test results for MS2Query and all benchmarking methods (cosine, modified cosine and MS2Deepscore) run for each of the test split. So the script should be started 20 times for each type of test split. The running time for this script is a few days, since it trains all models and creates all test results.

python from ms2query.benchmarking.k_fold_cross_validation import train_models_and_test_result_from_k_fold_folder k_fold_split_number = 0 # Vary this number between 0 and 19 train_models_and_test_result_from_k_fold_folder( "./benchmarking_test_sets/exact_matches_test_sets_splits/", k_fold_split_number, exact_matches=True) # Change for analogue test set, this will change the precursor m/z prefiltering to match exact matches or analogue search for the reference benchmarking methods. After creating all the results, run the create_plot (the first block of python code) to create the new plots.

Contributing

If you want to contribute to the development of ms2query, have a look at the contribution guidelines.

License

Copyright (c) 2021, Netherlands eScience Center

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Owner

  • Name: Integrated Omics for MEtabolomics and Genomics Annotation
  • Login: iomega
  • Kind: organization

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: MS2Query
message: >-
  If you use this software, please cite it using these
  metadata.
type: software
authors:
  - affiliation: Wageningen University and Research
    family-names: Jonge
    name-particle: de
    given-names: Niek F.
    orcid: 'https://orcid.org/0000-0002-3054-6210'
  - affiliation: Wageningen University and Research
    family-names: Louwen
    given-names: Joris J. R.
    orcid: 'https://orcid.org/0000-0003-4887-9109'
  - affiliation: Netherlands eScience Center
    family-names: Huber
    given-names: Florian
    orcid: 'https://orcid.org/0000-0002-3535-9406'
  - affiliation: Wageningen University and Research
    family-names: Hooft
    name-particle: van der
    given-names: Justin J. J.
    orcid: 'https://orcid.org/0000-0002-9340-5511'
repository-code: 'https://github.com/iomega/ms2query'
abstract: >-
  Machine learning assisted library querying of MS/MS
  spectra.
license: Apache-2.0

GitHub Events

Total
  • Create event: 2
  • Release event: 1
  • Issues event: 3
  • Watch event: 10
  • Delete event: 2
  • Issue comment event: 3
  • Push event: 4
  • Pull request event: 2
  • Fork event: 3
Last Year
  • Create event: 2
  • Release event: 1
  • Issues event: 3
  • Watch event: 10
  • Delete event: 2
  • Issue comment event: 3
  • Push event: 4
  • Pull request event: 2
  • Fork event: 3

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 1,153
  • Total Committers: 7
  • Avg Commits per committer: 164.714
  • Development Distribution Score (DDS): 0.67
Top Committers
Name Email Commits
Jonge n****e@w****l 381
louwenjjr j****n@h****m 289
Niek de Jonge n****e@e****l 238
florian-huber 3****r@u****m 109
Niek de Jonge 7****e@u****m 98
florian-huber f****r@h****e 28
Joris Louwen 6****r@u****m 10
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 90
  • Total pull requests: 82
  • Average time to close issues: 6 months
  • Average time to close pull requests: 12 days
  • Total issue authors: 19
  • Total pull request authors: 3
  • Average comments per issue: 2.5
  • Average comments per pull request: 1.04
  • Merged pull requests: 70
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 4
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 25 minutes
  • Issue authors: 4
  • Pull request authors: 1
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • niekdejonge (43)
  • florian-huber (16)
  • mapio (6)
  • eeko-kon (5)
  • wjcranda (2)
  • justinjjvanderhooft (2)
  • louwenjjr (2)
  • thierrynieus (2)
  • guikool (1)
  • dmendez92 (1)
  • Javi-Rop (1)
  • Dmente92 (1)
  • jamesjiadazhan (1)
  • anani-a-missinou (1)
  • TOTOZAFY (1)
Pull Request Authors
  • niekdejonge (80)
  • florian-huber (7)
  • mapio (5)
Top Labels
Issue Labels
code structure (6) enhancement (5) user interface (3) storage (2) visual appearance (2) computational performance (2) bug (1) documentation (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 1,008 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 48
  • Total maintainers: 1
pypi.org: ms2query

Tool to query MS/MS spectra against mass spectral library

  • Versions: 48
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 1,008 Last month
Rankings
Dependent packages count: 10.0%
Stargazers count: 11.3%
Forks count: 11.4%
Downloads: 12.5%
Average: 13.4%
Dependent repos count: 21.7%
Maintainers (1)
Last synced: 6 months ago

Dependencies

setup.py pypi
  • gensim >=4.0.0
  • h5py <3.0.0
  • matchms >=0.11.0,<=0.13.0
  • ms2deepscore *
  • numpy *
  • pandas >=1.2.5
  • scikit-learn *
  • spec2vec >=0.6.0
  • tensorflow *
.github/workflows/CI_build.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
  • sonarsource/sonarcloud-github-action master composite
.github/workflows/pypi_publish.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
  • pypa/gh-action-pypi-publish master composite
environment.yml conda
  • h5py 3.11.0.*
  • matchms 0.26.4.*
  • matplotlib 3.7.2.*
  • ms2deepscore 2.0.0.*
  • numpy 1.24.4.*
  • onnxruntime 1.17.0.*
  • pandas 2.2.2.*
  • pyarrow 16.1.0.*
  • pytest 8.2.2.*
  • pytest-cov 5.0.0.*
  • python 3.9.18.*
  • scikit-learn 1.5.0.*
  • skl2onnx 1.16.0.*
  • spec2vec 0.8.0.*
  • zip