https://github.com/kundajelab/bpnet-refactor
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: nature.com -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.1%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: kundajelab
- License: mit
- Language: Python
- Default Branch: main
- Size: 6.74 MB
Statistics
- Stars: 6
- Watchers: 5
- Forks: 3
- Open Issues: 6
- Releases: 0
Metadata Files
README.md
BPNet
BPNet is a python package with a CLI to train and interpret base-resolution deep neural networks trained on functional genomics data such as ChIP-nexus or ChIP-seq. This is the improved version of the repo associated the BPNet paper. We suggest using this going forward instead of the original repo.
Table of contents
Installation Tutorial How to Cite Full Documentation
Installation
This section will discuss the packages needed to train and interpret the BPNet models. Firstly, it is recommended that you use a GPU for model training and have the necessary NVIDIA drivers and CUDA already installed. You can verify that your machine is set up to use GPU's properly by executing the nvidia-smi command and ensuring that the command returns information about your system GPU(s) (rather than an error). Secondly there are three ways to ensure you have the necessary packages which we detail below:
Instead of installing the BPNet repo by yourself, you can also try the Docker or Anvil options to train/use BPNet models.
1. Docker
Download and install the latest version of Docker for your platform. Here is the link for the installers -Docker Installers. Run the docker run command below to open an environment with all the packages installed.
``` docker pull vivekramalingam/tf-atlas:gcp-modeling_v2.1.0-rc.1
docker run -it --rm --cpus=10 --memory=32g --gpus device=1 --mount src=/mnt/bpnet-models/,target=/mydata,type=bind vivekramalingam/tf-atlas:gcp-modeling_v2.1.0-rc.1
```
2. Anvil
The NHGRI's AnVIL (Genomic Data Science Analysis, Visualization, and Informatics Lab-space) platform allows researchers with no/minimal computation skills to run analysis tools with just a few mouse clicks. Using our BPNet workflow on the AnVIL platform you can train state-of-art deep learning BPNet models for ChIP-seq data without any programming experience. We highly recommand this option for biologists. It is also highly scalable as GPU provisioning on the cloud etc., is automatically handled by AnVIL. We trained our Atlas scale models on the entire ENCODE compendium using the AnVIL platform.
It is also available for training models for chromatin accessibility experiments.
A video turorial for training BPNet models on AnVIL is available here:
***TODO - Anvil Tutorial
3. Local Installation
3.1. Install Miniconda
Download and install the latest version of Miniconda for your platform. Here is the link for the installers - Miniconda Installers
3.2. Create new virtual environment
Create a new virtual environment and activate it as shown below
conda create --name bpnet python=3.7
conda activate bpnet
3.3. Install basepairmodels
``` pip install git+https://github.com/kundajelab/bpnet-refactor.git
```
3.4 Additional tools to install
For the following steps you will need samtools bamtools and bedGraphToBigWig, which are not
installed as part of this repository.
The tools can be installed via the links below or using conda.
bedGraphToBigWig (Linux 64-bit)
bedGraphToBigWig (Mac OSX 10.14.6)
``` conda install -y -c bioconda samtools=1.1 bedtools ucsc-bedgraphtobigwig
```
Tutorial
0. Optional additional walk through for downloading and preprocessing an example data:
1. Train a model!
Before we start training, we need to compile a json file that contains information about the input data. We will call this file input_data.json. Here is a sample json file that shows how to specify the input data information for the data. The data is organized into tasks and tracks. In this example we have one task and two tracks, the plus and the minus strand. Each task has 4 required keys, with values corresponding to tracks calculated in the preprocessing steps:
signal: the plus and minus bigwig tracks
loci: the bed file with the filtered peaks
background_loci (optional): the bed file with gc-matched background regions (source), and the ratio of positive-to-negative regions used when generating the gc-matched negatives file, expressed as a decimal (ratio)
bias (optional): the plus and minus control bigwig tracks
Note that the input_data.json file is used for multiple downstream steps.
{
"0": {
"signal": {
"source": ["ENCSR000EGM/data/plus.bw",
"ENCSR000EGM/data/minus.bw"]
},
"loci": {
"source": ["ENCSR000EGM/data/peaks_inliers.bed"]
},
"background_loci": {
"source": ["ENCSR000EGM/data/gc_negatives.bed"],
"ratio": [0.33]
},
"bias": {
"source": ["ENCSR000EGM/data/control_plus.bw",
"ENCSR000EGM/data/control_minus.bw"],
"smoothing": [null, null]
}
}
}
Additionally, we need a json file to specify parameters for the BPNet architecture. Let's call this json file
bpnet_params.json
{
"input_len": 2114,
"output_profile_len": 1000,
"motif_module_params": {
"filters": [64],
"kernel_sizes": [21],
"padding": "valid"
},
"syntax_module_params": {
"num_dilation_layers": 8,
"filters": 64,
"kernel_size": 3,
"padding": "valid",
"pre_activation_residual_unit": true
},
"profile_head_params": {
"filters": 1,
"kernel_size": 75,
"padding": "valid"
},
"counts_head_params": {
"units": [1],
"dropouts": [0.0],
"activations": ["linear"]
},
"profile_bias_module_params": {
"kernel_sizes": [1]
},
"counts_bias_module_params": {
},
"use_attribution_prior": false,
"attribution_prior_params": {
"frequency_limit": 150,
"limit_softness": 0.2,
"grad_smooth_sigma": 3,
"profile_grad_loss_weight": 200,
"counts_grad_loss_weight": 100
},
"loss_weights": [1, 42],
"counts_loss": "MSE"
}
The loss_weights field has two values: the profile loss weight and the
counts loss weight. Profile loss weight is always set to 1. Optimal counts loss weight can be automatically generated using the following command:
bpnet-counts-loss-weight --input-data input_data.json
Finally, we will make the splits.json file, which contains information about the chromosomes that are used for
validation and test. Here is a sample that contains one split.
{
"0": {
"test":
["chr7", "chr13", "chr17", "chr19", "chr21", "chrX"],
"val":
["chr10", "chr18"],
"train":
["chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr8", "chr9", "chr11", "chr12", "chr14", "chr15", "chr16", "chr21", "chr22", "chrY"]
}
}
Now that we have our data prepped, we can train our first model!!
The command to train a model is called bpnet-train.
You can kick start the training process by executing this command in your shell
``` BASEDIR=ENCSR000EGM DATADIR=$BASEDIR/data MODELDIR=$BASEDIR/models REFERENCEDIR=reference CHROMSIZES=$REFERENCEDIR/hg38.chrom.sizes REFERENCEGENOME=$REFERENCEDIR/hg38.genome.fa CVSPLITS=$BASEDIR/splits.json INPUTDATA=$BASEDIR/inputdata.json MODELPARAMS=$BASEDIR/bpnetparams.json
mkdir $MODELDIR bpnet-train \ --input-data $INPUTDATA \ --output-dir $MODELDIR \ --reference-genome $REFERENCEGENOME \ --chroms $(paste -s -d ' ' $REFERENCEDIR/chroms.txt) \ --chrom-sizes $CHROMSIZES \ --splits $CVSPLITS \ --model-arch-name BPNet \ --model-arch-params-json $MODELPARAMS \ --sequence-generator-name BPNet \ --model-output-filename model \ --input-seq-len 2114 \ --output-len 1000 \ --shuffle \ --threads 10 \ --epochs 100 \ --batch-size 64 \ --reverse-complement-augmentation \ --early-stopping-patience 10 \ --reduce-lr-on-plateau-patience 5 \ --learning-rate 0.001 ```
Note: It might take a few minutes for the training to begin once the above command has been issued, be patient and you should see the training eventually start.
2. Model Prediction
Once the training is complete we can generate predictions on the test set chromosomes to understand how well the model performs on regions not seen during training.
PREDICTIONS_DIR=$BASE_DIR/predictions_and_metrics
mkdir $PREDICTIONS_DIR
bpnet-predict \
--model $MODEL_DIR/model_split000 \
--chrom-sizes $CHROM_SIZES \
--chroms chr7 chr13 chr17 chr19 chr21 chrX \
--test-indices-file None \
--reference-genome $REFERENCE_GENOME \
--output-dir $PREDICTIONS_DIR \
--input-data $INPUT_DATA \
--sequence-generator-name BPNet \
--input-seq-len 2114 \
--output-len 1000 \
--output-window-size 1000 \
--batch-size 64 \
--reverse-complement-average \
--threads 2 \
--generate-predicted-profile-bigWigs
This script will output test metrics and also output bigwig tracks if the
--generate-predicted-profile-bigWigs is specified
It is usefull to also make predictions on all the peaks as the model predictions provide denoised version of the observed signal.
bpnet-predict \
--model $MODEL_DIR/model_split000 \
--chrom-sizes $CHROM_SIZES \
--chroms chr1 chr2 chrX chr3 chr4 chr5 chr6 chr7 chr10 chr8 chr14 chr9 chr11 chr13 chr12 chr15 chr16 chr17 chrY chr18 chr19 chrM \
--test-indices-file None \
--reference-genome $REFERENCE_GENOME \
--output-dir $PREDICTIONS_DIR \
--input-data $INPUT_DATA \
--sequence-generator-name BPNet \
--input-seq-len 2114 \
--output-len 1000 \
--output-window-size 1000 \
--batch-size 64 \
--reverse-complement-average \
--threads 2 \
--generate-predicted-profile-bigWigs
3. Compute importance scores
"Understanding why a model makes a certain prediction can be as crucial as the prediction’s accuracy in many applications". SHAP (SHapley Additive exPlanations) assigns each feature an importance value for a particular prediction. In our case we will assign each nucleotide an importance value or how much it contributed for the binding at each input region according to our model. A model that can predict the binding signal well at the test regions presumably must have learnt usefull features and thus the importance score values can shed light in to the driving sequence feature including the Transcription factor binding motifs.
``` SHAPDIR=$BASEDIR/shap mkdir $SHAPDIR bpnet-shap \ --reference-genome $REFERENCEGENOME \ --model $MODELDIR/modelsplit000 \ --bed-file $DATADIR/peaksinliers.bed \ --output-dir $SHAPDIR \ --input-seq-len 2114 \ --control-len 1000 \ --task-id 0 \ --input-data $INPUTDATA \ --chrom-sizes $CHROM_SIZES \ --generate-shap-bigWigs
can also provide for example --chroms chr1 to generate shap only on a subset of chromosomes (chr1)
```
This script will output shap bigwig tracks which are usefull for visualizing in a genome browser the bases/motif that are important for binding example if
--generate-shap-bigWigs is specified.
UCSC, WashU browsers currently support visualizing the importance scores - use the Dynseq option.
Please cite:
The dynseq browser track shows context-specific features at nucleotide resolution
S Nair, A Barrett, D Li, BJ Raney, BT Lee, P Kerpedjiev, V Ramalingam, ... Nature genetics 54 (11), 1581-1583
4. Discover motifs with TF-modisco
"TF-MoDISco is a biological motif discovery algorithm that differentiates itself by using attribution scores from a machine learning model, in addition to the sequence itself, to guide motif discovery. Using the attribution scores, as opposed to just the sequence, can be beneficial because the attributions fine-map the specific sequence drivers of biology. Although in many of our examples this model is BPNet and the attributions are from DeepLIFT/DeepSHAP, there is no limit on what attribution algorithm is used, or what model the attributions come from."
This step allows one to get a list of sequence motifs that drive the signal in the studied experiment.
Use the newer version of TF-modisco called modisco-lite to discover the motifs. Support to directly use modisco-lite from the BPNet repo will be added later. For now use: https://github.com/jmschrei/tfmodisco-lite
```
for example
conda create --name tfmodisco python=3.10 conda activate tfmodisco pip install modisco-lite
MODISCODIR=$BASEDIR/modisco MODISCOCOUNTSDIR=$BASEDIR/modisco/counts MODISCOPROFILEDIR=$BASEDIR/modisco/profile mkdir -p $MODISCOCOUNTSDIR mkdir -p $MODISCOPROFILEDIR
modisco motifs\ --maxseqlets 50000 \ --h5py $SHAPDIR/profilescores.h5 \ -o $MODISCOPROFILEDIR/profilemodiscoscores.h5 \ --trimsize 20 \ --initialflanktoadd 5 \ --finalflanktoadd 10
modisco motifs\ --maxseqlets 50000 \ --h5py $SHAPDIR/countsscores.h5 \ -o $MODISCOCOUNTSDIR/countsmodiscoscores.h5 \ --trimsize 20 \ --initialflanktoadd 5 \ --finalflanktoadd 10
```
5. Discover location of the identified motifs with FiNeMo
While TF-modisco allows one to get a list of sequence motifs, FiNeMo allows one to map the location of these motifs in all the input regions. FiNeMO is a GPU-accelerated hit caller for retrieving TFMoDISCo motif occurences from machine-learning-model-generated contribution scores.
Support to directly use FiNeMO from the BPNet repo will be added later. For now use: https://github.com/austintwang/finemo_gpu
```
for example
git clone https://github.com/austintwang/finemogpu.git cd finemogpu
conda env create -f environment.yml -n finemo conda activate finemo pip install --editable .
cd ..
HITSDIRCOUNTS=$BASEDIR/hits/counts
mkdir -p HITSDIR_COUNTS
finemo extract-regions-bpnet-h5 -c $SHAPDIR/countsscores.h5 -o ${HITSDIRCOUNTS}/regionsbw.npz -w 400 -p $BASEDIR/data/peaks_inliers.bed
finemo call-hits -l ${lambda} -r ${HITSDIRCOUNTS}/regionsbw.npz -m ${modiscoh5} -p $BASEDIR/data/peaksinliers.bed -C $CHROMSIZES -o ${HITSDIR_COUNTS} -t 0.7 --compile
finemo report -H ${HITSDIRCOUNTS}/hits.tsv -r ${HITSDIRCOUNTS}/regionsbw.npz -m ${modiscoh5} -p $BASEDIR/data/peaksinliers.bed -o ${HITSDIRCOUNTS}/ -t 0.7 -W 400 ```
How to Cite
If you're using BPNet in your work, please cite the original BPNet paper:
Avsec, Ž., Weilert, M., Shrikumar, A. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet 53, 354–366 (2021).
And reach out to Vivekanandan Ramalingam and Anshul Kundaje to discuss how to cite this update version. An preprint associated with this new version will soon be released and will be pointed out here.
Owner
- Name: Kundaje Lab
- Login: kundajelab
- Kind: organization
- Location: Stanford University
- Website: http://anshul.kundaje.net
- Repositories: 117
- Profile: https://github.com/kundajelab
Compbio and machine learning code repositories from the Kundaje Lab at Stanford Genetics and Computer Science Depts.
GitHub Events
Total
- Issues event: 14
- Watch event: 4
- Delete event: 1
- Issue comment event: 7
- Push event: 15
- Pull request review event: 2
- Pull request event: 2
- Gollum event: 4
- Fork event: 4
- Create event: 4
Last Year
- Issues event: 14
- Watch event: 4
- Delete event: 1
- Issue comment event: 7
- Push event: 15
- Pull request review event: 2
- Pull request event: 2
- Gollum event: 4
- Fork event: 4
- Create event: 4
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 10
- Total pull requests: 1
- Average time to close issues: 6 months
- Average time to close pull requests: 3 days
- Total issue authors: 7
- Total pull request authors: 1
- Average comments per issue: 0.9
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 7
- Pull requests: 1
- Average time to close issues: 16 days
- Average time to close pull requests: 3 days
- Issue authors: 6
- Pull request authors: 1
- Average comments per issue: 1.14
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- vhecht (6)
- manoloff (2)
- G-Chapman (1)
- patrickgohl (1)
- pengjk689 (1)
- jamesgalante (1)
- arodel21 (1)
Pull Request Authors
- vhecht (2)
- kmualim (1)