https://github.com/brianhie/mutational-semantics-neurips2020
Learning mutational semantics
https://github.com/brianhie/mutational-semantics-neurips2020
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: biorxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.9%) to scientific vocabulary
Repository
Learning mutational semantics
Basic Info
- Host: GitHub
- Owner: brianhie
- License: mit
- Language: Python
- Default Branch: master
- Size: 203 MB
Statistics
- Stars: 8
- Watchers: 3
- Forks: 2
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Learning mutational semantics
This repository contains the analysis code, links to the data, and pretrained models for the paper "Learning mutational semantics" by Brian Hie, Ellen Zhong, Bryan Bryson, and Bonnie Berger, which appeared as a poster at NeurIPS 2020.
For a more biologically-oriented follow-up work, including analysis of SARS-CoV-2 viral sequences, see our paper "Learning the language of viral evolution and escape".
Data
You can download the relevant datasets (including training and validation data) using the commands
bash
wget http://cb.csail.mit.edu/cb/viral-mutation/data.tar.gz
tar xvf data.tar.gz
within the same directory as this repository.
Dependencies
The major Python package requirements and their tested versions are in requirements.txt.
Our experiments were run with Python version 3.7 on Ubuntu 18.04.
Experiments
To run the experiments below, download the data (instructions above). Our experiments require a maximum of 400 GB of CPU RAM and 32 GB of GPU RAM (though often much less); in silico escape model inference can take around 35 minutes for influenza HA and 90 minutes for HIV Env.
News headlines
Headline part-of-speech changes and WordNet changes can be evaluated with the command
bash
python bin/parse_headline_mods.py results/headlines/semantics_1024.log.gz \
> headline_pos.log 2>&1
Generating headline changes can be done with the command
bash
python bin/headlines.py bilstm --checkpoint data/headlines.hdf5 --semantics \
> semantics.log 2>&1 &
Influenza HA
Influenza HA semantic embedding UMAPs and log files with statistics can be generated with the command
bash
python bin/flu.py bilstm --checkpoint models/flu.hdf5 --embed \
> flu_embed.log 2>&1
Single-residue escape prediction using validation data from Doud et al. (2018) and Lee et al. (2019) can be done with the command
bash
python bin/flu.py bilstm --checkpoint models/flu.hdf5 --semantics \
> flu_semantics.log 2>&1
Training a new model on flu HA sequences can be done with the command
bash
python bin/flu.py bilstm --train --test \
> flu_train.log 2>&1
HIV Env
HIV Env semantic embedding UMAPs and log files with statistics can be generated with the command
bash
python bin/hiv.py bilstm --checkpoint models/hiv.hdf5 --embed \
> hiv_embed.log 2>&1
Single-residue escape prediction using validation data from Dingens et al. (2019) can be done with the command
bash
python bin/hiv.py bilstm --checkpoint models/hiv.hdf5 --semantics \
> hiv_semantics.log 2>&1
Training a new model on HIV Env sequences can be done with the command
bash
python bin/hiv.py bilstm --train --test \
> hiv_train.log 2>&1
Questions
For questions about the pipeline and code, contact brianhie@mit.edu. We will do our best to provide support, address any issues, and keep improving this software. And do not hesitate to submit a pull request and contribute!
Owner
- Name: Brian Hie
- Login: brianhie
- Kind: user
- Location: San Francisco
- Website: brianhie.com
- Twitter: brianhie
- Repositories: 36
- Profile: https://github.com/brianhie
GitHub Events
Total
- Watch event: 2
Last Year
- Watch event: 2
Dependencies
- biopython ==1.76
- flair ==0.4.5
- hmmlearn ==0.2.3
- keras ==2.3.1
- matplotlib ==3.1.1
- nltk ==3.4.5
- numpy ==1.17.2
- pandas ==0.25.1
- pattern3 ==3.0.0
- scanpy ==1.4.5.1
- scikit-learn ==0.21.3
- scipy ==1.3.1
- seaborn ==0.9.0
- tensorflow-gpu ==2.2.1