https://github.com/kundajelab/mpra-dragonn
Code accompanying the paper "Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays"
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: nature.com -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.8%) to scientific vocabulary
Repository
Code accompanying the paper "Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays"
Basic Info
Statistics
- Stars: 11
- Watchers: 7
- Forks: 4
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
MPRA-DragoNN: Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays
This project applies convolutional neural networks to predict output from massively parallel reporter assays (MPRAs), with the aim of systematically decoding regulatory sequence patterns and identifying noncoding variants that may affect gene expression.
Data
This project uses the Sharpr-MPRA dataset from Ernst et al. 2016 (https://www.nature.com/articles/nbt.3678). The raw data were minimally processed, as described in the paper and below. The prepared training, validation, and testing datasets in hdf5 format can be downloaded from here.
Technical details:
The raw data for each of the Sharpr-MPRA experiments was downloaded from the Gene Expression Omnibus, accession number GSE71279. These raw files are hosted at the following Dropbox link.
The raw counts from the experiments were processed by (1) computing log2(RNA+1 / DNA+1) for each 145bp sequence in each of the 12 tasks (described below); (2) column-wise z-score normalization of the log fold-changes (i.e., each task's output values had mean 0 and variance 1); (3) including the reverse complement of each sequence as another datapoint with the same activity values.
Each datapoint was then converted to a 145x4 NumPy array corresponding to the one-hot encoding of the sequence's ACGT representation. The label for each sequence was a length-12 array containing the normalized activity values for the 12 tasks. The data was split as follows: sequences on chr8 for validation (~30K), chr18 for testing (~20K), and the remaining chromosomes for training (~900K); the resulting hdf5 files are the ones at the data link.
Task description:
- "k562minprep1": K562 cell line, minimal promoter, replicate 1
- "k562minprep2": K562 cell line, minimal promoter, replicate 2
- "k562minpavg": K562 cell line, minimal promoter, average*
- "k562sv40prep1": K562 cell line, strong SV40 promoter, replicate 1
- "k562sv40prep1": K562 cell line, strong SV40 promoter, replicate 2
- "k562sv40pavg": K562 cell line, strong SV40 promoter, average*
- "hepg2minprep1": HepG2 cell line, minimal promoter, replicate 1
- "hepg2minprep2": HepG2 cell line, minimal promoter, replicate 2
- "hepg2minpavg": HepG2 cell line, minimal promoter, average*
- "hepg2sv40prep1": HepG2 cell line, strong SV40 promoter, replicate 1
- "hepg2sv40prep1": HepG2 cell line, strong SV40 promoter, replicate 2
- "hepg2sv40prep1": HepG2 cell line, strong SV40 promoter, average*
*The "average" tasks are computed by pooling counts between replicates, i.e. computing log2(RNARep1 + RNARep2 + 1) - log(DNA + 1).
Model Training
The inputs to our model are shape (145, 4) NumPy arrays corresponding to one-hot encoded 145 base-pair DNA sequences. The outputs are 12 continuous values corresponding to normalized activity levels of the sequence in different cellular contexts (described above).
The neural network model used for MPRA activity prediction is a fairly standard convolutional architecture for genomics. We use three convolution layers (ReLU activation), each containing 120 filters of width 5, followed by a single fully connected layer (linear activation) to predict the 12 tasks. Our model uses task-wise mean squared error loss and our primary evaluation criteria (for validation/testing) is the Spearman correlation (robust to outliers, unlike Pearson).
The models have been implemented in Keras with a Tensorflow backend. To train the model:
bash
python main.py --data_path /path/to/data
To resume training from an existing checkpoint:
bash
python main --data_path /path/to/data --pretrained_model_checkpoint /path/to/checkpoint/model
During training, the model produces logs in the experiments directory, which can be visualized using tensorboard as:
bash
tensorboard --logdir /path/to/log/dir/in/experiments
To evaluate on test set, pass the --evaluate 1 flag in addition to resuming the model from the checkpoint.
For other inputs, such as hyperparameters, refer
bash
python main.py --help
Prediction and Interpretation
We have provided pretrained models in kipoi/ConvModel and kipoi/DeepFactorizedModel directories. We provide support for prediction and interpretation through Kipoi. The model json and yaml files are also available in the kipoi directory. The models can be loaded as:
```python import kipoi
model = kipoi.get_model("SNPpet/ConvModel") # or "SNPpet/DeepFactorizedModel" ```
Follow the instructions on Kipoi to make predictions on arbitrary sequences and for interpreting the model.
Help
Feel free to direct questions about this project to Rajiv Movva: rmovva at mit dot edu, or open an Issue.
Citation
If you use this code for your research, please cite our paper: Movva R, Greenside P, Marinov GK, Nair S, Shrikumar A, Kundaje A (2019). Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays. PLoS ONE 14(6): e0218073. https://doi.org/10.1371/journal.pone.0218073
Owner
- Name: Kundaje Lab
- Login: kundajelab
- Kind: organization
- Location: Stanford University
- Website: http://anshul.kundaje.net
- Repositories: 117
- Profile: https://github.com/kundajelab
Compbio and machine learning code repositories from the Kundaje Lab at Stanford Genetics and Computer Science Depts.
GitHub Events
Total
- Watch event: 1
- Fork event: 1
Last Year
- Watch event: 1
- Fork event: 1
Committers
Last synced: over 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| suragnair | s****r@h****m | 25 |
| Rajiv Movva | t****v@g****m | 5 |
| Surag Nair | s****r | 5 |