massbank2db

Build a local SQLite Database from Massbank.

https://github.com/bachi55/massbank2db

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.4%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Build a local SQLite Database from Massbank.

Basic Info
  • Host: GitHub
  • Owner: bachi55
  • License: gpl-3.0
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 616 KB
Statistics
  • Stars: 3
  • Watchers: 1
  • Forks: 1
  • Open Issues: 4
  • Releases: 1
Created over 5 years ago · Last pushed about 4 years ago
Metadata Files
Readme Changelog License Citation

README.md

Build a local SQLite Database from Massbank

MassBank is an open repository for mass spectrum (MS) data, including tandem MS (MS²). Meta information, such as the ground truth molecular structure, retention times (RT) and instrument setup, are provided for the MassBank records.

The 'massbank2db' package in a nutshell

The massbank2db package provides functionality to parse Massbank records and organize them into a local SQLite database. The main focus is to make MassBank's MS data available for machine learning method development, which is achieved by the following functionality:

  • Molecular structure representations, such as SMILES and InChI, can be updated using PubChem, but matching the provided InChIKeys in MassBank, which allows for consistent and standardized molecule representations for all data points.
  • The MassBank records are grouped by their measurement conditions, that is the chromatographic and mass spectrometry setup, such that, for example, retention times are comparable within each group.
  • The database creation process enforces complete information on MassBank records and furthermore filters data which could be corrupted. That means, if relevant information, such as the retention time or ground truth structure annotation is missing, a record is not considered. Furthermore, officially deprecated and otherwise "suspicious" records are filtered out.

Installation

That's how you install the package:

1) Clone the latest version of the package and change to the directory: bash git clone https://github.com/bachi55/massbank2db cd massbank2db

2) (Optional) Create a conda environment: bash conda env create -f environment.yml conda activate massbank2db

3) Install the package: ```bash

Build the binaries for the spectra merging implementation

python setup.py build_ext --inplace

Install the package

pip install . ```

4) (Optional) Test the installation: ```bash python -m unittest discover -s massbank2db/tests -p 'unittests*.py'

Expected output

...s......................

----------------------------------------------------------------------

Ran 26 tests in 0.150s

OK (skipped=1)

```

Using the package

Current limitations (<= version 0.9.0)

  • The package has been tested with MassBank release 2020.11 and later releases might cause errors in the record grouping
    • The main reason is that the measurement conditions (specifically for the chromatography) are not reported in a structured and unified style.
    • For the 2020.11 release, a pull-request homogenising the setup description has been merged.
  • The package does not support MS¹ (only) records.
    • The main reason is that the package initially was developed for a specific machine learning question ad hand.
    • MS¹ was not relevant for the research question.
    • Adding MS¹ support should not be difficult, but requires some additional thinking.
  • The package requires a local PubChem SQLite DB with a specific structure.
    • Again, that is legacy from the original research context.
    • Such a database can be constructed using the pubchem2sqlite package using the default DB layout.
  • Some meta-information for the dataset (= MassBank groups) cannot be automatically determined.
    • The liquid-chromatography column type, i.e. reversed-phase (RP) or HILIC, must be added manually.
    • The column dead-time can only be estimated when all relevant column information are provided. Otherwise, it must be added manually.
  • The package currently only supports liquid-chromatography (LC) records.
    • Again, that is legacy from the original research context.
    • Support for gas-chromatography could be added, but it probably requires some tables to be redesigned, i.e. new columns are added.
    • Specifically the meta-information for the datasets is currently tailored to LC systems.

Building the MassBank database (example using release 2020.11)

Note: At the moment you need to download forked MassBank release 2020.11. That is because the fork implements a small fix in the contributor table, that is allows masbank2db to determine all relevant sub-directories in the repository, containing the MS data, and associating each sub-directory with the available MassBank record prefixes.

Note 2: The following instructions assume, that you have created your own PubChem SQLite DB.

1) Clone the MassBank repository release 2020.11: bash git clone -b 2020.11-branch https://github.com/bachi55/MassBank-data

2) (Optional) Activate your conda environment: bash conda activate massbank2db

3) (Optional) Run the data generation script with --help option: ```bash python generatemassbanksqlite.py --help

Expected output ##

usage: generatemassbanksqlite.py [-h] [--usepubchemstructureinfo] [--pubchemdbfn PUBCHEMDBFN] massbankrepodir massbankdb_fn

positional arguments:

massbankrepodir Directory containing the local MassBank copy. It is assumed to be a clone of the MassBank GitHub repository.

massbankdbfn Output filename of the MassBank SQLite database.

optional arguments:

-h, --help show this help message and exit

--usepubchemstructure_info

Indicating whether a local PubChem copy should be used to update the molecular structure information.

--pubchemdbfn PUBCHEMDBFN

Filename of the local PubChem SQLite database file.

```

4) Build the MassBank DB: bash python generate_massbank_sqlite.py /path/to/MassBank-data /tmp/massbank.sqlite --use_pubchem_structure_info --pubchem_db_fn=/path/to/pubchem.sqlite

Citing the package

If you use this package, please cite:

bibtex @software{massbank2db, author = {Bach, Eric}, month = {1}, title = {{massbank2db: Build a machine learning ready SQLite database from MassBank.}}, url = {https://github.com/bachi55/massbank2db}, version = {0.9.0}, year = {2022} }

Owner

  • Name: Eric Bach
  • Login: bachi55
  • Kind: user
  • Location: Espoo, Finnland
  • Company: Aalto University

Doctoral student in the field of Machine Learning, Bioinformatics and Computational Metabolomics.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Bach"
  given-names: "Eric"
  email: eric.bach@aalto.fi
  affiliation: Aalto University, Espoo, Finland
title: "massbank2db: Build a machine learning ready SQLite database from MassBank."
version: 0.9.0
date-released: 2022-01-18
url: "https://github.com/bachi55/massbank2db"

GitHub Events

Total
Last Year

Dependencies

environment.yml conda
  • numpy
  • pandas
  • python >=3.8
  • scipy
setup.py pypi
  • numpy *
  • pandas *
  • scipy *
  • setuptools >=46.1