https://github.com/ammar257ammar/smilesvecproteinrepresentation
https://github.com/ammar257ammar/smilesvecproteinrepresentation
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 4 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.2%) to scientific vocabulary
Last synced: 9 months ago
·
JSON representation
Repository
Basic Info
- Host: GitHub
- Owner: ammar257ammar
- Default Branch: master
- Size: 15.6 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of hkmztrk/SMILESVecProteinRepresentation
Created over 5 years ago
· Last pushed over 6 years ago
https://github.com/ammar257ammar/SMILESVecProteinRepresentation/blob/master/
# About SMILESVec based Protein Representation
Here, we represent proteins using their interactings ligands. We utilize SMILES representation of ligands and propose, SMILESVec, which is a ligand representation that is built using [Word2vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) model by Mikolov et al.
Each SMILES is divided into overlapping subsequences that we call chemical words. Then Word2Vec learns a high-dimensional and real-valued vector for each of these chemical words. SMILES vector is described as the average of the vectors of its chemical word vectors.
We used [Gensim](https://radimrehurek.com/gensim/) implementation to build word-embeddings.

****************************************************************
## Installation
### Data
"data" folder contains the input and output files.
"source code" folder contains python source code.
Embeddings files are provided in [here](https://cmpe.boun.edu.tr/~hakime.ozturk/smilesvec.html)
### Requirements
You'll need to install following in order to run the codes.
* Python 2.7.x or Python 3.x
* numpy
* sklearn
* [chembl_webresource_client](https://github.com/chembl/chembl_webresource_client)
- for dependency issues:
- pip install --force-reinstall gevent==1.2.2
- pip install --force-reinstall greenlet==0.4.12
* pickle
In order to run the code you have to place an embedding file under ```utils``` folder inside the source folder.
You can use either ```drug.l8.chembl23.canon.ws20.txt``` or ```drug.l8.pubchem.canon.ws20.txt```
# Usage
### get SMILESVec for given SMILES
For a list of SMILES strings, it outputs the corresponding SMILESVec.
The following code runs for ```smiles_sample.txt``` file under utils folder.
```
python getsmilesvec.py [embedding_file_name]
python getsmilesvec.py drug.l8.chembl23.canon.ws20.txt
```
output: ```smiles.vec``` is a pickle file.
use ```pickle.load(open("smiles.vec"))``` to open it.
### get SMILESVec-based representation for given protein (UniProt ID)
For a list of UniProt IDs, it outputs the corresponding SMILESVec-based protein vectors.
The following code runs for ```prots_sample.txt``` file under utils folder.
```
python getligprotvec.py [embedding_file_name]
python getligprotvec.py drug.l8.pubchem.canon.ws20.txt
```
output: ```prot.vec``` is a pickle file.
use ```pickle.load(open("prot.vec"))``` OR
```
with open('protein.vec', 'rb') as f:
prots= pickle.load(f, encoding='bytes')
```
to open it.
### How to train your own embeddings of SMILES?
Please refer to [README here](https://github.com/hkmztrk/SMILESVecProteinRepresentation/tree/master/source/word2vec) for detailed information and source code.
### SMILESVec-based Protein Similarity for SCOP A-50
```
will be updated
```
**For citation:**
[A novel methodology on distributed representations of proteins using their interacting ligands](https://academic.oup.com/bioinformatics/article/34/13/i295/5045707)
```
@article{Ozturk2018Anovel,
author = {ztrk, Hakime and Ozkirimli, Elif and zgr, Arzucan},
title = {A novel methodology on distributed representations of proteins using their interacting ligands},
journal = {Bioinformatics},
volume = {34},
number = {13},
pages = {i295-i303},
year = {2018},
doi = {10.1093/bioinformatics/bty287},
URL = {http://dx.doi.org/10.1093/bioinformatics/bty287}
}
```
Owner
- Name: Ammar Ammar
- Login: ammar257ammar
- Kind: user
- Location: The Netherlands
- Company: Maastricht University
- Repositories: 14
- Profile: https://github.com/ammar257ammar