corems

CoreMS is a comprehensive mass spectrometry software framework

https://github.com/emsl-computing/corems

Science Score: 59.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
    7 of 10 committers (70.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.2%) to scientific vocabulary

Keywords

complex-mixture data-analysis dissolved-organic-matter mass-spectrometry metabolomics metabolomics-pipeline molecular-database molecular-formulae-assignment molecular-search natural-organic-matter soil-organic-matter
Last synced: 6 months ago · JSON representation

Repository

CoreMS is a comprehensive mass spectrometry software framework

Basic Info
Statistics
  • Stars: 59
  • Watchers: 7
  • Forks: 37
  • Open Issues: 2
  • Releases: 3
Topics
complex-mixture data-analysis dissolved-organic-matter mass-spectrometry metabolomics metabolomics-pipeline molecular-database molecular-formulae-assignment molecular-search natural-organic-matter soil-organic-matter
Created almost 6 years ago · Last pushed 6 months ago
Metadata Files
Readme Contributing License Support Zenodo

README.md

CoreMS Logo



DOI

Table of Contents


CoreMS

CoreMS is a comprehensive mass spectrometry framework for software development and data analysis of small molecules analysis.

Data handling and software development for modern mass spectrometry (MS) is an interdisciplinary endeavor requiring skills in computational science and a deep understanding of MS. To enable scientific software development to keep pace with fast improvements in MS technology, we have developed a Python software framework named CoreMS. The goal of the framework is to provide a fundamental, high-level basis for working with all mass spectrometry data types, allowing custom workflows for data signal processing, annotation, and curation. The data structures were designed with an intuitive, mass spectrometric hierarchical structure, thus allowing organized and easy access to the data and calculations. Moreover, CoreMS supports direct access for almost all vendors data formats, allowing for the centralization and automation of all data processing workflows from the raw signal to data annotation and curation.

CoreMS aims to provide - logical mass spectrometric data structure - self-containing data and metadata storage - modern molecular formulae assignment algorithms - dynamic molecular search space database search and generator


Current Version

3.8.0


Main Developers/Contact


Documentation

API documentation can be found here.

Overview slides can be found here.


Contributing

As an open source project, CoreMS welcomes contributions of all forms. Before contributing, please see our Dev Guide


Data formats

Data input formats

  • Bruker Solarix (CompassXtract)
  • Bruker Solarix transients, ser and fid (FT magnitude mode only)
  • ThermoFisher (.raw)
  • Spectroswiss signal booster data-acquisition station (.hdf5)
  • MagLab ICR data-acquisition station (FT and magnitude mode) (.dat)
  • ANDI NetCDF for GC-MS (.cdf)
  • mzml for LC-MS (.mzml)
  • Generic mass list in profile and centroid mde (include all delimiters types and Excel formats)
  • CoreMS exported processed mass list files(excel, .csv, .txt, pandas dataframe as .pkl)
  • CoreMS self-containing Hierarchical Data Format (.hdf5)
  • Pandas Dataframe
  • Support for cloud Storage using s3path.S3path

Data output formats

  • Pandas data frame (can be saved using pickle, h5, etc)
  • Text Files (.csv, tab separated .txt, etc)
  • Microsoft Excel (xlsx)
  • Automatic JSON for metadata storage and reuse
  • Self-containing Hierarchical Data Format (.hdf5) including raw data and time-series data-point for processed data-sets with all associated metadata stored as json attributes

Data structure types

  • LC-MS
  • GC-MS
  • Transient
  • Mass Spectra
  • Mass Spectrum
  • Mass Spectral Peak
  • Molecular Formula

Available features

FT-MS Signal Processing, Calibration, and Molecular Formula Search and Assignment

  • Apodization, Zerofilling, and Magnitude mode FT
  • Manual and automatic noise threshold calculation
  • Peak picking using apex quadratic fitting
  • Experimental resolving power calculation
  • Frequency and m/z domain calibration functions:
  • LedFord equation
  • Linear equation
  • Quadratic equation
  • Automatic search most abundant Ox homologue series
  • Automatic local (SQLite) or external (PostgreSQL) database check, generation, and search
  • Automatic molecular formulae assignments algorithm for ESI(-) MS for natural organic matter analysis
  • Automatic fine isotopic structure calculation and search for all isotopes
  • Flexible Kendrick normalization base
  • Kendrick filter using density-based clustering
  • Kendrick classification
  • Heteroatoms classification and visualization

GC-MS Signal Processing, Calibration, and Compound Identification

  • Baseline detection, subtraction, smoothing
  • m/z based Chromatogram Peak Deconvolution,
  • Manual and automatic noise threshold calculation
  • First and second derivatives peak picking methods
  • Peak Area Calculation
  • Retention Index Calibration
  • Automatic local (SQLite) or external (MongoDB or PostgreSQL) database check, generation, and search
  • Automatic molecular match algorithm with all spectral similarity methods

High Resolution Mass Spectrum Simulations

  • Peak shape (Lorentz, Gaussian, Voigt, and pseudo-Voigt)
  • Peak fitting for peak shape definition
  • Peak position in function of data points, signal to noise and resolving power (Lorentz and Gaussian)
  • Prediction of mass error distribution
  • Calculated ICR Resolving Power based on magnetic field (B), and transient time(T)

LC-MS Signal Processing, Molecular Formula Search and Assignment, and Spectral Similarity Searches

See walkthough in this notebook - Two dimensional (m/z and retention time) peak picking using persistent homology - Smoothing, cetroid detection, and integration of extracted ion chromatograms - Peak shape metric calculations including half peak height, tailing factor, and dispersity index - MS1 deconvolution of mass features - Idenfitication of 13C isotopes within the mass features - Compatibility with molecular formula searching on MS1 or MS2 spectra - Spectral search capability using entropy similarity


Installation

bash pip install corems

By default the molecular formula database will be generated using SQLite

To use Postgresql the easiest way is to build a docker container:

bash docker-compose up -d

  • Change the urldatabase on MSParameters.molecularsearch.url_database to: "postgresql+psycopg2://coremsappdb:coremsapppnnl@localhost:5432/coremsapp"
  • Set the urldatabase env variable COREMSDATABASE_URL to: "postgresql+psycopg2://coremsappdb:coremsapppnnl@localhost:5432/coremsapp"

Thermo Raw File Access:

To be able to open thermo file a installation of pythonnet is needed: - Windows: bash pip install pythonnet

  • Mac and Linux: bash brew install mono pip install pythonnet

Docker stack

Another option to use CoreMS is to run the docker stack that will start the CoreMS containers

Molecular Database and Jupyter Notebook Docker Containers

A docker container containing: - A custom python distribution will all dependencies installed - A Jupyter notebook server with workflow examples - A PostgreSQL database for the molecular formulae assignment

If you don't have docker installed, the easiest way is to install docker for desktop

  1. Start the containers using docker-compose (easiest way):

    On docker-compose-jupyter.yml there is a volume mapping for the tests_data directory with the data provided for testing, to change to your data location:

- locate the volumes on docker-compose-jupyter.yml:

```bash
volumes:
  - ./tests/tests_data:/home/CoreMS/data
```
- change "./tests/tests_data" to your data directory location

```bash
volumes:
  - path_to_your_data_directory:/home/corems/data
```
- save the file and then call:

```bash
docker-compose -f docker-compose-jupyter.yml up
```
  1. Another option is to manually build the containers:
- Build the corems image:
    ```bash
    docker build -t corems:local .
    ```
- Start the database container:
    ```bash
    docker-compose up -d   
    ```
- Start the Jupyter Notebook:
    ```bash
    docker run --rm -v ./data:/home/CoreMS/data corems:local
    ```

- Open your browser, copy and past the URL address provided in the terminal: `http://localhost:8888/?token=<token>.`

- Open the CoreMS-Tutorial.ipynb

Example for FT-ICR Data Processing

More examples can be found under the directory examples/scripts, examples/notebooks

  • Basic functionality example

```python from corems.transient.input.brukerSolarix import ReadBrukerSolarix from corems.molecularid.search.molecularFormulaSearch import SearchMolecularFormulas from corems.massspectrum.output.export import HighResMassSpecExport from matplotlib import pyplot

filepath= 'tests/testsdata/ftms/ESINEGSRFA.d'

Instatiate the Bruker Solarix reader with the filepath

brukerreader = ReadBrukerSolarix(filepath)

Use the reader to instatiate a transient object

brukertransientobj = brukerreader.gettransient()

Calculate the transient duration time

T = brukertransientobj.transient_time

Use the transient object to instatitate a mass spectrum object

massspectrumobj = brukertransientobj.getmassspectrum(plotresult=False, autoprocess=True)

The following SearchMolecularFormulas function does the following

- searches monoisotopic molecular formulas for all mass spectral peaks

- calculates fine isotopic structure based on monoisotopic molecular formulas found and current dynamic range

- searches molecular formulas of correspondent calculated isotopologues

- settings are stored at SearchConfig.json and can be changed directly on the file or inside the framework class

SearchMolecularFormulas(massspectrumobj, firsthit=False).runworkermassspectrum()

Iterate over mass spectral peaks objs within the massspectrumobj

for mspeak in massspectrumobj.sortbyabundance():

# If there is at least one molecular formula associated, mspeak returns True
if  mspeak:

    # Get the molecular formula with the highest mass accuracy
    molecular_formula = mspeak.molecular_formula_lowest_error

    # Plot mz and peak height
    pyplot.plot(mspeak.mz_exp, mspeak.abundance, 'o', c='g')

    # Iterate over all molecular formulas associated with the ms peaks obj
    for molecular_formula in mspeak:

        # Check if the molecular formula is a isotopologue
        if molecular_formula.is_isotopologue:

            # Access the molecular formula text representation and print
            print (molecular_formula.string)

            # Get 13C atoms count
            print (molecular_formula['13C'])
else:
    # Get mz and peak height
    print(mspeak.mz_exp,mspeak.abundance)

Save data

to a csv file

massspectrumobj.tocsv("filename") massspectrumobj.tohdf("filename")

to pandas Datarame pickle

massspectrumobj.to_pandas("filename")

Extract data as a pandas Dataframe

df = massspectrumobj.to_dataframe() ```


UML Diagrams

UML (unified modeling language) diagrams for Direct Infusion FT-MS and GC-MS classes can be found here.


Citing CoreMS

If you use CoreMS in your work, please use the following citation:

Version 3.8.0 Release on GitHub, archived on Zenodo:

DOI

Yuri E. Corilo, William R. Kew, Lee Ann McCue, Katherine R . Heal, James C. Carr (2024, October 29). EMSL-Computing/CoreMS: CoreMS 3.0.0 (Version v3.0.0), as developed on Github. Zenodo. http://doi.org/10.5281/zenodo.14009575

```


This material was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the United States Department of Energy, nor Battelle, nor any of their employees, nor any jurisdiction or organization that has cooperated in the development of these materials, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness or any information, apparatus, product, software, or process disclosed, or represents that its use would not infringe privately owned rights.

Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof, or Battelle Memorial Institute. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

             PACIFIC NORTHWEST NATIONAL LABORATORY
                          operated by
                            BATTELLE
                            for the
               UNITED STATES DEPARTMENT OF ENERGY
                under Contract DE-AC05-76RL01830

Owner

  • Name: EMSL Computing
  • Login: EMSL-Computing
  • Kind: organization

GitHub Events

Total
  • Create event: 9
  • Issues event: 6
  • Release event: 8
  • Watch event: 7
  • Delete event: 4
  • Member event: 2
  • Issue comment event: 11
  • Push event: 28
  • Pull request event: 1
  • Fork event: 9
Last Year
  • Create event: 9
  • Issues event: 6
  • Release event: 8
  • Watch event: 7
  • Delete event: 4
  • Member event: 2
  • Issue comment event: 11
  • Push event: 28
  • Pull request event: 1
  • Fork event: 9

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 1,419
  • Total Committers: 10
  • Avg Commits per committer: 141.9
  • Development Distribution Score (DDS): 0.062
Top Committers
Name Email Commits
Corilo, Yuri c****o@p****v 1,331
Will Kew w****w@p****v 34
saka326 j****i@p****v 20
Thompson, Allison M a****n@p****v 17
Smith, Ian M i****h@p****v 5
Clendinen c****n@p****v 4
Kzra e****n@g****m 3
dependabot[bot] 4****]@u****m 2
deweycw c****y@s****u 2
Yuri Corilo z****i@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 8
  • Total pull requests: 15
  • Average time to close issues: 12 days
  • Average time to close pull requests: about 1 month
  • Total issue authors: 5
  • Total pull request authors: 7
  • Average comments per issue: 2.88
  • Average comments per pull request: 0.6
  • Merged pull requests: 10
  • Bot issues: 0
  • Bot pull requests: 2
Past Year
  • Issues: 3
  • Pull requests: 1
  • Average time to close issues: 27 days
  • Average time to close pull requests: N/A
  • Issue authors: 3
  • Pull request authors: 1
  • Average comments per issue: 4.33
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Kzra (3)
  • deweycw (2)
  • mhhur (1)
  • uby76 (1)
  • 1218594966 (1)
Pull Request Authors
  • Kzra (4)
  • deweycw (4)
  • corilo (2)
  • jmrd98 (2)
  • dependabot[bot] (2)
  • rboiteau (1)
  • GeorgiosDolias (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (2) enhancement (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 156 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 5
  • Total versions: 41
  • Total maintainers: 1
pypi.org: corems

Mass Spectrometry Framework for Small Molecules Analysis

  • Versions: 41
  • Dependent Packages: 0
  • Dependent Repositories: 5
  • Downloads: 156 Last month
Rankings
Dependent repos count: 6.6%
Forks count: 8.1%
Dependent packages count: 10.0%
Stargazers count: 10.4%
Average: 10.6%
Downloads: 17.6%
Maintainers (1)
Last synced: 6 months ago

Dependencies

requirements.txt pypi
  • IsoSpecPy ==2.0.1
  • beautifulsoup4 >=4.8.1
  • chardet >=3.0.4
  • h5py >=2.10.0
  • lmfit >=1.0.0
  • lxml >=4.4.1
  • matplotlib >=3.1.1
  • netCDF4 ==1.5.3
  • numpy >=1.17.3,<1.20.0
  • openpyxl >=2.6.3
  • pandas >=0.25.0
  • psutil >=5.6.6
  • psycopg2-binary >=2.8.3
  • pymc3 >=3.8
  • pyswarm *
  • python-dateutil >=2.8.0
  • pywavelets *
  • s3path *
  • scipy >=1.3.0
  • sklearn >=0.0
  • sqlalchemy >=1.4
  • sqlalchemy-utils *
  • tqdm >=4.43.0
  • urllib3 ==1.26.5
  • xlrd >=1.2.0
requirements-dev.txt pypi
  • bumpversion * development
  • memory_profiler * development
  • pylint * development
  • pyprof2calltree * development
  • pytest * development
  • pytest-cov * development
  • twine * development