https://github.com/ammar257ammar/swat4hcls2022-chembl-bioschemas-mapping

Hackathon project aims at mapping the ChEMBL RDF small molecules, proteins and taxons onto Bioschemas.org entities and produce the corresponding JSON-LD

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.5%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Hackathon project aims at mapping the ChEMBL RDF small molecules, proteins and taxons onto Bioschemas.org entities and produce the corresponding JSON-LD

Basic Info

Host: GitHub
Owner: ammar257ammar
License: gpl-3.0
Language: Jupyter Notebook
Default Branch: master
Size: 1.89 MB

Statistics

Stars: 1
Watchers: 1
Forks: 2
Open Issues: 1
Releases: 0

Created over 4 years ago · Last pushed over 3 years ago

Metadata Files

Readme License

SWAT4HCLS Hackathon Bioschemas Project

During the hackathon of SWAT4HCLS (Jan 10-13th 2022), I worked on a project aiming at providing the ChEMBL database in JSON-LD format according to the bioschemas.org vocabulary.

Researchers participated in this project: - Ammar Ammar (https://orcid.org/0000-0002-8399-8990) - Alasdair Gray (https://orcid.org/0000-0003-1460-8327) - François Belleau (https://orcid.org/0000-0002-9816-1093)

The project focused on mapping ChEMBL data onto 3 types of entities from the Bioschemas vocbulary:

The approach adopted in this project is based on using the ChEMBL mirror SPARQL endpoint (v28) hosted by the department of Bioinforamtics at Maastricht University (BiGCaT) to construct the new RDF (following the Bioschemas vocabulary) from the ChEMBL RDF. A mapping between the ChEMBL entities and predicates and the Bioschemas ones was performed using SPARQL queries according to the following figures.

Mapping ChEMBL "SmallMolecule" to Bioschemas "MolecularEntity"

Mapping ChEMBL "SingleProtein" to Bioschemas "Protein"

Implementation

The SPARQL queries used for the mapping are available in the "queries" folder.
The mapping was implemented using Python, Jupyter Notebook and the SPARQLWrapper package. The notebook "ETL.ipynb" contains the code for mapping the ChEMBL RDF to Bioschemas and serializing the results into JSON-LD format. The construction of the molecular entities was performed in batches (100k molecules in each batch).
The process took ~4.5 hours using a personal laptop (Core-i7 CPU & 16GB RAM)
Number of mapped molecules: 1920028 molecules (~2 million molecules)
Number of mapped proteins: 8525 proteins
Size of the output JSON-LD: 2.68 GB unzipped (380 MB zipped)
The following figure shows an overview of the implementation

NOTE: you can dowload the JSON-LD resulted from this project from the releases tab

Owner

Name: Ammar Ammar
Login: ammar257ammar
Kind: user
Location: The Netherlands
Company: Maastricht University

Repositories: 14
Profile: https://github.com/ammar257ammar

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/ammar257ammar/swat4hcls2022-chembl-bioschemas-mapping

Science Score: 10.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

SWAT4HCLS Hackathon Bioschemas Project

Mapping ChEMBL "SmallMolecule" to Bioschemas "MolecularEntity"

Mapping ChEMBL "SingleProtein" to Bioschemas "Protein"

Implementation

NOTE: you can dowload the JSON-LD resulted from this project from the releases tab

Owner

GitHub Events

Total

Last Year