https://github.com/brianhie/mutational-semantics-neurips2020

Learning mutational semantics

https://github.com/brianhie/mutational-semantics-neurips2020

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.9%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Learning mutational semantics

Basic Info
  • Host: GitHub
  • Owner: brianhie
  • License: mit
  • Language: Python
  • Default Branch: master
  • Size: 203 MB
Statistics
  • Stars: 8
  • Watchers: 3
  • Forks: 2
  • Open Issues: 0
  • Releases: 0
Created over 5 years ago · Last pushed over 5 years ago
Metadata Files
Readme License

README.md

Learning mutational semantics

This repository contains the analysis code, links to the data, and pretrained models for the paper "Learning mutational semantics" by Brian Hie, Ellen Zhong, Bryan Bryson, and Bonnie Berger, which appeared as a poster at NeurIPS 2020.

For a more biologically-oriented follow-up work, including analysis of SARS-CoV-2 viral sequences, see our paper "Learning the language of viral evolution and escape".

Data

You can download the relevant datasets (including training and validation data) using the commands bash wget http://cb.csail.mit.edu/cb/viral-mutation/data.tar.gz tar xvf data.tar.gz within the same directory as this repository.

Dependencies

The major Python package requirements and their tested versions are in requirements.txt.

Our experiments were run with Python version 3.7 on Ubuntu 18.04.

Experiments

To run the experiments below, download the data (instructions above). Our experiments require a maximum of 400 GB of CPU RAM and 32 GB of GPU RAM (though often much less); in silico escape model inference can take around 35 minutes for influenza HA and 90 minutes for HIV Env.

News headlines

Headline part-of-speech changes and WordNet changes can be evaluated with the command bash python bin/parse_headline_mods.py results/headlines/semantics_1024.log.gz \ > headline_pos.log 2>&1

Generating headline changes can be done with the command bash python bin/headlines.py bilstm --checkpoint data/headlines.hdf5 --semantics \ > semantics.log 2>&1 &

Influenza HA

Influenza HA semantic embedding UMAPs and log files with statistics can be generated with the command bash python bin/flu.py bilstm --checkpoint models/flu.hdf5 --embed \ > flu_embed.log 2>&1

Single-residue escape prediction using validation data from Doud et al. (2018) and Lee et al. (2019) can be done with the command bash python bin/flu.py bilstm --checkpoint models/flu.hdf5 --semantics \ > flu_semantics.log 2>&1

Training a new model on flu HA sequences can be done with the command bash python bin/flu.py bilstm --train --test \ > flu_train.log 2>&1

HIV Env

HIV Env semantic embedding UMAPs and log files with statistics can be generated with the command bash python bin/hiv.py bilstm --checkpoint models/hiv.hdf5 --embed \ > hiv_embed.log 2>&1

Single-residue escape prediction using validation data from Dingens et al. (2019) can be done with the command bash python bin/hiv.py bilstm --checkpoint models/hiv.hdf5 --semantics \ > hiv_semantics.log 2>&1

Training a new model on HIV Env sequences can be done with the command bash python bin/hiv.py bilstm --train --test \ > hiv_train.log 2>&1

Questions

For questions about the pipeline and code, contact brianhie@mit.edu. We will do our best to provide support, address any issues, and keep improving this software. And do not hesitate to submit a pull request and contribute!

Owner

  • Name: Brian Hie
  • Login: brianhie
  • Kind: user
  • Location: San Francisco

GitHub Events

Total
  • Watch event: 2
Last Year
  • Watch event: 2

Dependencies

requirements.txt pypi
  • biopython ==1.76
  • flair ==0.4.5
  • hmmlearn ==0.2.3
  • keras ==2.3.1
  • matplotlib ==3.1.1
  • nltk ==3.4.5
  • numpy ==1.17.2
  • pandas ==0.25.1
  • pattern3 ==3.0.0
  • scanpy ==1.4.5.1
  • scikit-learn ==0.21.3
  • scipy ==1.3.1
  • seaborn ==0.9.0
  • tensorflow-gpu ==2.2.1