https://github.com/arvkevi/generif2vec

Doc2vec model trained on NCBI Gene Reference into Function (GeneRIF)-annotated PubMed abstracts.

https://github.com/arvkevi/generif2vec

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.0%) to scientific vocabulary

Keywords

doc2vec genomics nlp
Last synced: 5 months ago · JSON representation

Repository

Doc2vec model trained on NCBI Gene Reference into Function (GeneRIF)-annotated PubMed abstracts.

Basic Info
  • Host: GitHub
  • Owner: arvkevi
  • Language: Python
  • Default Branch: master
  • Size: 700 KB
Statistics
  • Stars: 2
  • Watchers: 3
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
doc2vec genomics nlp
Created about 6 years ago · Last pushed about 6 years ago
Metadata Files
Readme

README.md

generif2vec

Resources and tools to work with Doc2vec and NCBI's gene reference into function (GeneRIF).

umap

GeneRIF Abstracts

GeneRIF is a resource that includes gene annotations curated by genetics experts. The file used for training the Doc2vec model and for linking PubMed ID to Gene ID is hosted by NCBI.

The generif2vec/text module has code to: * Download the abstracts for GeneRIF PubMed articles. * Process the abstracts from raw text into TaggedDocuments for Doc2vec using spacy.

NOTE: The annotated gene symbol in each abstract was replaced with @gene$

Pretrained Doc2vec model

Download the pretrained 100-dimension Doc2vec model here and use it as shown below.

The parameter settings for this model were chosen after hyperparameter tuning was performed on Google Cloud AI Platform following the methods described in trainer/README.

This model achieved 59.1% top 5 accuracy and 64.1% top 10 accuracy on a held out test set (25%) of abstracts. More information regarding the evaluation of the model can be found in evaluation.

Installation

Installing the package makes it easy to work with your own text. shell script git clone https:@github.com:arvkevi/generif2vec.git cd generif2vec python setup.py install

Library Usage

Load the downloaded model as a Doc2vec model: python from gensim.models.doc2vec import Doc2Vec generif2vec_model = '/path/to/downloaded_model' model = Doc2Vec.load(generif2vec_model) Process sample text:
From unseen text summary description of STAT1 from Uniprot: python stat1_uniprot = """ Signal transducer and transcription activator that mediates cellular responses to interferons, cytokine KITLG/SCF and other cytokines and other growth factors. Following type I IFN (IFN-alpha and IFN-beta) binding to cell surface receptors, signaling via protein kinases leads to activation of Jak kinases and to tyrosine phosphorylation of STAT1 and STAT2. The phosphorylated @genes dimerize and associate with ISGF3G/IRF-9 to form a complex termed ISGF3 transcription factor, that enters the nucleus. """ Process the text: python from generif2vec.text.util import tokenize test_tokens = tokenize([stat1_uniprot]) Explore the most similar results: ```python testvec = model.infervector(testtokens[0]) model.docvecs.mostsimilar([test_vec], topn=10)

[('STAT2', 0.6756633520126343), ('STAT1', 0.6652628183364868), ('IRF9', 0.59737229347229), ('JAK1', 0.5927700996398926), ('IKBKB', 0.5757358074188232), ('TYK2', 0.567344069480896), ('IKBKE', 0.565180778503418), ('IFNB1', 0.5631940364837646), ('STAT3', 0.5513948798179626), ('IFNAR2', 0.5507417917251587)] ```

Command Line Usage

There are currently two commands: train-models and similar-genes.

Train the Doc2vec models with different parameters. bash generif2vec train-models -- --help Predict the most similar gene for a directory of text files. bash generif2vec similar-genes -- --help

Abstracts stats

Number of unique genes with at least 10 PubMed IDs: 7704
Number of abstracts with genes having at least 10 PubMed IDs: 731,297

Evaluation

The trainer module will evaluate document embeddings by default. Evaluation of the document embeddings was performed by splitting the abstracts into training and test data stratified by gene symbol for equal proportions of gene symbols in each set. The model was trained on the training data set and evaluated on the test data set.

| Top 5 Accuracy | Top 10 Accuracy | Median Gene Rank | Median Similarity Difference | | -------------- | --------------- | ---------------- | ---------------------------- | | 59.1% | 64.9% | 4 | 0.049 |

  • top_k accuracy: The percentage of abstracts in the test set where the true gene label was in the top k most similar predicted gene labels. (Top 5 of 7704 = 0.06% of all gene labels)
  • median gene rank: The median ranking of the true gene label among all 7704 gene labels in the test set.
  • median similarity difference: The median difference between the top ranked document similarity value (using Doc2vec most_similar) and the document similarity for the true gene label.

Comparison to Gene Ontolgoy

Use this notebook to compare the results of Doc2vec embeddings to gene similarity using the molecular function branch of the Gene Ontology with GoSemSim.

STAT1

Owner

  • Name: Kevin Arvai
  • Login: arvkevi
  • Kind: user
  • Location: Washington, D.C.

Data science & clinical genomics

GitHub Events

Total
Last Year

Dependencies

requirements.txt pypi
  • biopython *
  • biothings_client *
  • cloudml-hypertune *
  • fire *
  • gcsfs *
  • gensim *
  • joblib *
  • loguru *
  • numpy *
  • pandas *
  • python-dateutil ==2.8
  • scikit-learn *
  • scipy *
  • spacy *
  • streamlit *
  • umap-learn *
setup.py pypi
  • x.strip *