Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: Kohulan
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 544 KB
Statistics
  • Stars: 5
  • Watchers: 3
  • Forks: 1
  • Open Issues: 0
  • Releases: 1
Created almost 5 years ago · Last pushed over 4 years ago
Metadata Files
Readme License Citation

README.md

License Maintenance GitHub issues GitHub contributors DOI DOI

Performance of chemical structure string representations for chemical image recognition using transformers

  • The use of molecular string representations for deep learning in chemistry has been steadily increasing in recent years. The complexity of existing string representations, and the difficulty in creating meaningful tokens from them, lead to the development of new string representations for chemical structures. In this study, the translation of chemical structure depictions in the form of bitmap images to corresponding molecular string representations was examined. An analysis of the recently developed DeepSMILES and SELFIES representations in comparison with the most commonly used SMILES representation is presented where the ability to translate image features into string representations with transformer models was specifically tested. The SMILES representation exhibits the best overall performance whereas SELFIES guarantee valid chemical structures. DeepSMILES perform in between SMILES and SELFIES, InChIs are not appropriate for the learning task. All investigations were performed using publicly available datasets and the code used to train and evaluate the models has been made available to the public.

GitHub Logo

Usage

  • To use scripts available here, please clone the repository in your local hard disk and you can continue working with it.
  • The datasets are available in zenodo as SMILES, you can use the provided SMILES Depictor java code to generate the image files.
We recommend to use DECIMER inside a Conda environment to facilitate the installation of the dependencies.
  • Conda can be downloaded as part of the Anaconda or the Miniconda plattforms (Python 3.7). We recommend to install miniconda3. Using Linux you can get it with: $ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh $ bash Miniconda3-latest-Linux-x86_64.sh

More on using the model to train and evaluate please refer to our DECIMER Image Transformer repository

License:

  • This project is licensed under the MIT License - see the LICENSE file for details

Citation

  • Rajan K, Steinbeck C, Zielesny A. Performance of chemical structure string representations for chemical image recognition using transformers. ChemRxiv. Cambridge: Cambridge Open Engage; 2021; This content is a preprint and has not been peer-reviewed.

Acknowledgement

  • We are grateful for the company @Google making free computing time on their TensorFlow Research Cloud infrastructure available to us.

Author: Kohulan

GitHub Logo

Project Website: DECIMER

Research Group

GitHub Logo

Owner

  • Name: Kohulan Rajan
  • Login: Kohulan
  • Kind: user
  • Location: Jena,Germany
  • Company: Friedrich-Schiller-University

PostDoc @Steinbeck-Lab Currently based at Friedrich-Schiller-University, Jena

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite both the article from preferred-citation and the software itself."
title: "Performance of chemical structure string representations for chemical image recognition using transformers"
abstract: "The use of molecular string representations for deep learning in chemistry has been steadily increasing in recent years. The complexity of existing string representations, and the difficulty in creating meaningful tokens from them, lead to the development of new string representations for chemical structures. In this study, the translation of chemical structure depictions in the form of bitmap images to corresponding molecular string representations was examined. An analysis of the recently developed DeepSMILES and SELFIES representations in comparison with the most commonly used SMILES representation is presented where the ability to translate image features into string representations with transformer models was specifically tested. The SMILES representation exhibits the best overall performance whereas SELFIES guarantee valid chemical structures. DeepSMILES perform in between SMILES and SELFIES, InChIs are not appropriate for the learning task. All investigations were performed using publicly available datasets and the code used to train and evaluate the models has been made available to the public.
"
authors:
  - family-names: "Rajan"
    given-names: "Kohulan"
    orcid: "https://orcid.org/0000-0003-1066-7792"
  - family-names: "Steinbeck"
    given-names: "Christoph"
    orcid: "https://orcid.org/0000-0001-6966-0814"   
  - family-names: "Zielesny"
    given-names: "Achim"
    orcid: "https://orcid.org/0000-0003-0722-4229"
version: 1.0
date-released: "2021-09-17"
identifiers:
  - description: "This is the scientific publication which describes the software"
    type: doi
    value: "n/a"
  - description: "This is the archived snapshot"
    type: doi
    value: "10.5281/zenodo.5513452"
  - description: "Data archive"
    type: doi
    value: "10.5281/zenodo.5155037"
license: MIT
repository-code: "https://github.com/Kohulan/DECIMER_Short_Communication"

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels