glycosylationstatistics

Systematic detection of sugar moieties in COCONUT using the Sugar Removal Utility

https://github.com/jonasschaub/glycosylationstatistics

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 38 DOI reference(s) in README
  • Academic publication links
    Links to: acs.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Systematic detection of sugar moieties in COCONUT using the Sugar Removal Utility

Basic Info
  • Host: GitHub
  • Owner: JonasSchaub
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: master
  • Size: 3.64 MB
Statistics
  • Stars: 3
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 3
Created over 5 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

README.md

DOI License: MIT GitHub issues GitHub contributors GitHub release

Description and Analysis of Glycosidic Residues in the Largest Open Natural Products Database

Code for automated, systematic detection of sugar moieties in the COlleCtion of Open Natural prodUcTs (COCONUT) database

Description

This repository contains Java source code for automatically detecting and analysing glyosidic moieties in silico in the largest open natural products database COCONUT, as described in Schaub, J.; Zielesny, A.; Steinbeck, C.; Sorokina, M. Description and Analysis of Glycosidic Residues in the Largest Open Natural Products Database. Biomolecules 2021, 11, 486., using the Sugar Removal Utility.
Additionally, similar analyses are done with datasets from the ZINC15 database, DrugBank, and ChEMBL.
Python scripts and Jupyter notebooks for the curation of some used datasets and analysing and visualising the results are also supplied in this repository.

Please NOTE that the code in this repository is primarily supposed to show how the glycosylation statistics published in the article linked above were generated and to allow reproduction and executing of the same analyses for other datasets. It is not considered a software by itself. Hence, things like the publication of a Maven artifact for straightforward installation are not given here.

The Sugar Removal Utility, however, can be installed as a Maven artifact in a straightforward manner and used in your own scripts and workflows to analyse other datasets this way.

Contents

Source code for glycosylation statistics analysis

In the directory /src/test/java/de/unijena/cheminf/deglycosylation/stats/ the class GlycosylationStatisticsTest can be found. It is a JUnit test class with multiple test methods that can be run in a script-like fashion to do the various analyses. Using an IDE like e.g. IntelliJ is recommended. Please note that some directories etc. will need to be adjusted and some datasets be put into the /src/test/resources/ directory (see below) to run the tests yourself.

The directory /Pythonscriptsand_notebooks/ contains a python script for picking a diverse subset of a larger datasets using the RDKit MaxMin algorithm. For the reported analyses, it has been used to reduce in size the downloaded ZINC "in-vitro" subset while preserving diversity. Additionally, two Jupyter Notebooks can be found in this directory that have been used to analyse and visualise some of the test results.

Installation

This is a Maven project. In order to do the described analyses on your own, download or clone the repository and open it in a Maven-supporting IDE (e.g. IntelliJ) as a Maven project and execute the pom.xml file. Maven will then take care of installing all dependencies.
To run the COCONUT-analysing tests, a MongoDB instance needs to be running on your platform and the COCONUT NP database imported to it. The respective MongoDB dump can be downloaded at https://coconut.naturalproducts.net/download.
To run the Python scripts and Jupyter Notebooks, installing Anaconda is recommended, to also ease the installation of required libraries, like the open-source cheminformatics software RDKit.

Required datasets

  • COCONUT: To run the COCONUT-analysing tests, a MongoDB instance needs to be running on your platform and the COCONUT NP database imported to it. The respective MongoDB dump can be downloaded at https://coconut.naturalproducts.net/download. Please check and adjust the credentials for the connection to MongoDB in the code and adjust them if needed. One test method also analyses COCONUT in the form of an SDF. This file can also be obtained from the given webpage and needs to be placed in the /src/test/resources/ directory.
  • ZINC15: A list of available ZINC15 subsets can be found here. It is recommended to use the program wget to download the subsets. All subsets were downloaded as SMILES files.
    • ZINC "for-sale": A part of the ZINC "for-sale" subset was downloaded for the published analyses and further reduced in size using the ZINCfor-salecuration.py script located in the /Pythonscriptsand_notebooks/ directory. One test method curates the dataset further. After this is done, the curated datasets needs to be placed in the /src/test/resources/ directory for it to be analysed by other test methods.
    • ZINC "in-vitro": One test method curates the dataset. After this is done, the curated datasets needs to be placed in the /src/test/resources/ directory for it to be analysed by other test methods.
    • ZINC "biogenic": The ZINC "biogenic" dataset needs to be placed in the /src/test/resources/ directory to be used for the curation of the other datasets.
  • Manually curated review of bacterial natural products sugar moieties: Two of the test methods do a substructure search in COCONUT for sugar moieties reported in bacterial natural products, manually curated by Elshahawi et al.. This dataset is already supplied in this repository in the /src/test/resources/ directory.
  • ChEMBL: The ChEMBL 28 database is curated in one test method and analysed for glycosidic moieties in another. To run the curation test, the dataset has to be placed in the /src/test/resources/ directory as an SDF. After curation, the curated dataset has to be placed in the same directory.
  • DrugBank: The DrugBank "all structures" dataset is curated in one test method and analysed for glycosidic moieties in another. To run the curation test, the dataset has to be placed in the /src/test/resources/ directory as an SDF. After curation, the curated dataset has to be placed in the same directory.

Dependencies

References and useful links

Glycosylation statistics of COCONUT publication * Schaub, J., Zielesny, A., Steinbeck, C., Sorokina, M. Description and Analysis of Glycosidic Residues in the Largest Open Natural Products Database. Biomolecules 2021, 11, 486. https://doi.org/10.3390/biom11040486

Sugar Removal Utility * Schaub, J., Zielesny, A., Steinbeck, C., Sorokina, M. Too sweet: cheminformatics for deglycosylation in natural products. J Cheminform 12, 67 (2020). https://doi.org/10.1186/s13321-020-00467-y * SRU Source code * Sugar Removal Web Application * Source Code of Web Application

Chemistry Development Kit (CDK) * Chemistry Development Kit on GitHub * Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen EL. The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo- and Bioinformatics. J Chem Inform Comput Sci. 2003;43(2):493-500. * Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen EL. Recent Developments of the Chemistry Development Kit (CDK) - An Open-Source Java Library for Chemo- and Bioinformatics. Curr Pharm Des. 2006; 12(17):2111-2120. * May JW and Steinbeck C. Efficient ring perception for the Chemistry Development Kit. J. Cheminform. 2014; 6:3. * Willighagen EL, Mayfield JW, Alvarsson J, Berg A, Carlsson L, Jeliazkova N, Kuhn S, Pluska T, Rojas-Chertó M, Spjuth O, Torrance G, Evelo CT, Guha R, Steinbeck C, The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J Cheminform. 2017; 9:33. * Groovy Cheminformatics with the Chemistry Development Kit

COlleCtion of Open NatUral producTs (COCONUT) * COCONUT Online home page * Sorokina, M., Merseburger, P., Rajan, K. et al. COCONUT online: Collection of Open Natural Products database. J Cheminform 13, 2 (2021). https://doi.org/10.1186/s13321-020-00478-9 * Sorokina, M., Steinbeck, C. Review on natural products databases: where to find data in 2020. J Cheminform 12, 20 (2020).

ZINC * ZINC15 Homepage * Sterling and Irwin, J. Chem. Inf. Model, 2015 http://pubs.acs.org/doi/abs/10.1021/acs.jcim.5b00559

MongoDB * MongoDB homepage * Java MongoDB Driver documentation

RDKit * RDKit homepage * Getting started with the RDKit in Python

DrugBank * DrugBank homepage * Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, Sajed T, Johnson D, Li C, Sayeeda Z, Assempour N, Iynkkaran I, Liu Y, Maciejewski A, Gale N, Wilson A, Chin L, Cummings R, Le D, Pon A, Knox C, Wilson M. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2017 Nov 8. doi: 10.1093/nar/gkx1037.

ChEMBL * ChEMBL homepage * Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR. (2017) 'The ChEMBL database in 2017.' Nucleic Acids Res., 45(D1) D945-D954.

Owner

  • Name: Jonas Schaub
  • Login: JonasSchaub
  • Kind: user
  • Location: Jena, Germany
  • Company: Friedrich-Schiller-University

Doctoral candidate of Steinbeck research group for cheminformatics and computational metabolomics. ORCID: 0000-0003-1554-6666

Citation (CITATION.cff)

cff-version: 1.2.0
title: Glycosylation Statistics
version: 1.0.2.0
message: "If you use this software, please cite it as below and also cite the accompanying scientific publication referenced below."
type: software
authors:
  - family-names: "Schaub"
    given-names: "Jonas"
    orcid: "https://orcid.org/0000-0003-1554-6666"
  - family-names: "Zielesny"
    given-names: "Achim"    
    orcid: "https://orcid.org/0000-0003-0722-4229"
  - family-names: "Steinbeck"
    given-names: "Christoph"
    orcid: "https://orcid.org/0000-0001-6966-0814"
  - family-names: "Sorokina"
    given-names: "Maria"
    orcid: "https://orcid.org/0000-0001-9359-7149"
doi: "10.5281/zenodo.7081511"
date-released: 2022-09-15
url: "https://github.com/JonasSchaub/GlycosylationStatistics"
license: MIT
references:
  - authors:
      - family-names: "Schaub"
        given-names: "Jonas"
        orcid: "https://orcid.org/0000-0003-1554-6666"
      - family-names: "Zielesny"
        given-names: "Achim"
        orcid: "https://orcid.org/0000-0003-0722-4229"
      - family-names: "Steinbeck"
        given-names: "Christoph"
        orcid: "https://orcid.org/0000-0001-6966-0814"
      - family-names: "Sorokina"
        given-names: "Maria"
        orcid: "https://orcid.org/0000-0001-9359-7149"
    doi: "10.3390/biom11040486"
    issue: 4
    journal: "Biomolecules"
    scope: "Cite this paper if you want to reference the general concepts of the software."
    title: "Description and Analysis of Glycosidic Residues in the Largest Open Natural Products Database"
    type: article
    volume: 11
    year: 2021

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 0
  • Total pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 6 days
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 2
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • dependabot[bot] (2)
Top Labels
Issue Labels
Pull Request Labels
dependencies (2)