https://github.com/alleninstitute/biomolvec

Notebooks and scripts used for the Nautilex Hackathon

https://github.com/alleninstitute/biomolvec

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (4.3%) to scientific vocabulary

Keywords

foundation-models generative-model genes multimodal
Last synced: 5 months ago · JSON representation

Repository

Notebooks and scripts used for the Nautilex Hackathon

Basic Info
Statistics
  • Stars: 0
  • Watchers: 3
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
foundation-models generative-model genes multimodal
Created 12 months ago · Last pushed 12 months ago
Metadata Files
Readme

Readme.md

Genome-wide nucleotide and amino acid sequences.

1. Download fasta files for cdna sequences:

  • Mus_musculus.GRCm39.cdna.all.fa.gz was downloaded from Ensemble
  • Homo_sapiens.GRCh38.cdna.all.fa.gz was downloaded from Ensemble

2. Extract nucleitide sequences from .fa files into dataframes (and saved as .csv files)

  • 00_get_seqs_mouse.ipynb
  • 00_get_seqs_human.ipynb

3. Use lookups with gget to get amino acid sequences using ensembl_ids

  • get_aa_seq_mouse.py
  • get_aa_seq_human.py

4. Merge and clean up the dataframes

  • 03_inspect_aa_seq_human.ipynb
  • 04_inspect_aa_seq.ipynb

The final dataframes are saved as

  • prot_nuc_seqs_mouse.csv
  • prot_nuc_seqs_human.csv

Example row:

|genesymbol|ensgid|enstid|nucseqlength|aaseqlength|nucaaseqratio|chromosome|start|end|strand|nucseq|aaseq| |---|---|---|---|---|---|---|---|---|---|---|---| |Gm20730|ENSMUSG00.. |ENSMUST00 |359|119.0|3.01|GRCm39:6|430 |4305 |-1.0|ATGAGGTGC |MRCLAEFLR.

Embedding nucleotide sequences with Nucleotide Transformer

We used Nucleotide Transformer models to embed sequences with a maximum length of 5952 nucleotides. Sequences longer than 5952 nucleotides were truncated. Specifically, we used the 500M_human_ref and 500M_multi_species_v2 models for human and mouse respectively.

  • 04_nuc_nt_emb_mouse.ipynb
  • 05_nuc_nt_emb_human.ipynb

Embedding amino-acid sequences with ESM3 and MMIDAS joint clustering

Combining results for analysis and visualization

We performed Leiden clustering for nt-emb and esm3-emb, along with 2d umap projections. Results are merged with those from MMIDAS joint clustering for further visualization.

  • 06_make_df.ipynb
  • 07_static_plots.ipynb
  • 08_dynamic_plots.ipynb
  • 09_enrichr_vignettes.ipynb

Contact:

  • Rohan Gala
  • Yeganeh Marghi

Owner

  • Name: Allen Institute
  • Login: AllenInstitute
  • Kind: organization
  • Location: Seattle, WA

Please visit http://alleninstitute.github.io/ for more information.

GitHub Events

Total
  • Push event: 4
  • Fork event: 1
Last Year
  • Push event: 4
  • Fork event: 1