https://github.com/alleninstitute/biomolvec
Notebooks and scripts used for the Nautilex Hackathon
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (4.3%) to scientific vocabulary
Keywords
Repository
Notebooks and scripts used for the Nautilex Hackathon
Basic Info
- Host: GitHub
- Owner: AllenInstitute
- Language: HTML
- Default Branch: main
- Homepage: https://alleninstitute.github.io/biomolvec
- Size: 8.29 MB
Statistics
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
Readme.md
Genome-wide nucleotide and amino acid sequences.
1. Download fasta files for cdna sequences:
Mus_musculus.GRCm39.cdna.all.fa.gzwas downloaded from EnsembleHomo_sapiens.GRCh38.cdna.all.fa.gzwas downloaded from Ensemble
2. Extract nucleitide sequences from .fa files into dataframes (and saved as .csv files)
00_get_seqs_mouse.ipynb00_get_seqs_human.ipynb
3. Use lookups with gget to get amino acid sequences using ensembl_ids
get_aa_seq_mouse.pyget_aa_seq_human.py
4. Merge and clean up the dataframes
03_inspect_aa_seq_human.ipynb04_inspect_aa_seq.ipynb
The final dataframes are saved as
prot_nuc_seqs_mouse.csvprot_nuc_seqs_human.csv
Example row:
|genesymbol|ensgid|enstid|nucseqlength|aaseqlength|nucaaseqratio|chromosome|start|end|strand|nucseq|aaseq| |---|---|---|---|---|---|---|---|---|---|---|---| |Gm20730|ENSMUSG00.. |ENSMUST00 |359|119.0|3.01|GRCm39:6|430 |4305 |-1.0|ATGAGGTGC |MRCLAEFLR.
Embedding nucleotide sequences with Nucleotide Transformer
We used Nucleotide Transformer models to embed sequences with a maximum length of 5952 nucleotides. Sequences longer than 5952 nucleotides were truncated. Specifically, we used the 500M_human_ref and 500M_multi_species_v2 models for human and mouse respectively.
04_nuc_nt_emb_mouse.ipynb05_nuc_nt_emb_human.ipynb
Embedding amino-acid sequences with ESM3 and MMIDAS joint clustering
- See the nautilex-esm repository
Combining results for analysis and visualization
We performed Leiden clustering for nt-emb and esm3-emb, along with 2d umap projections. Results are merged with those from MMIDAS joint clustering for further visualization.
06_make_df.ipynb07_static_plots.ipynb08_dynamic_plots.ipynb09_enrichr_vignettes.ipynb
Contact:
- Rohan Gala
- Yeganeh Marghi
Owner
- Name: Allen Institute
- Login: AllenInstitute
- Kind: organization
- Location: Seattle, WA
- Website: https://alleninstitute.org
- Repositories: 184
- Profile: https://github.com/AllenInstitute
Please visit http://alleninstitute.github.io/ for more information.
GitHub Events
Total
- Push event: 4
- Fork event: 1
Last Year
- Push event: 4
- Fork event: 1