https://github.com/kundajelab/bpnet-refactor

Last synced: 8 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: kundajelab
License: mit
Language: Python
Default Branch: main
Size: 6.74 MB

Statistics

Stars: 6
Watchers: 5
Forks: 3
Open Issues: 6
Releases: 0

Created over 3 years ago · Last pushed 12 months ago

Metadata Files

Readme Changelog License

BPNet

BPNet is a python package with a CLI to train and interpret base-resolution deep neural networks trained on functional genomics data such as ChIP-nexus or ChIP-seq. This is the improved version of the repo associated the BPNet paper. We suggest using this going forward instead of the original repo.

Installation

This section will discuss the packages needed to train and interpret the BPNet models. Firstly, it is recommended that you use a GPU for model training and have the necessary NVIDIA drivers and CUDA already installed. You can verify that your machine is set up to use GPU's properly by executing the nvidia-smi command and ensuring that the command returns information about your system GPU(s) (rather than an error). Secondly there are three ways to ensure you have the necessary packages which we detail below:

Instead of installing the BPNet repo by yourself, you can also try the Docker or Anvil options to train/use BPNet models.

1. Docker

Download and install the latest version of Docker for your platform. Here is the link for the installers -Docker Installers. Run the docker run command below to open an environment with all the packages installed.

``` docker pull vivekramalingam/tf-atlas:gcp-modeling_v2.1.0-rc.1

docker run -it --rm --cpus=10 --memory=32g --gpus device=1 --mount src=/mnt/bpnet-models/,target=/mydata,type=bind vivekramalingam/tf-atlas:gcp-modeling_v2.1.0-rc.1

```

2. Anvil

The NHGRI's AnVIL (Genomic Data Science Analysis, Visualization, and Informatics Lab-space) platform allows researchers with no/minimal computation skills to run analysis tools with just a few mouse clicks. Using our BPNet workflow on the AnVIL platform you can train state-of-art deep learning BPNet models for ChIP-seq data without any programming experience. We highly recommand this option for biologists. It is also highly scalable as GPU provisioning on the cloud etc., is automatically handled by AnVIL. We trained our Atlas scale models on the entire ENCODE compendium using the AnVIL platform.

Anvil/Terra

It is also available for training models for chromatin accessibility experiments.

A video turorial for training BPNet models on AnVIL is available here:

Video Tutorial

***TODO - Anvil Tutorial

3. Local Installation

3.1. Install Miniconda

Download and install the latest version of Miniconda for your platform. Here is the link for the installers - Miniconda Installers

3.2. Create new virtual environment

Create a new virtual environment and activate it as shown below

conda create --name bpnet python=3.7 conda activate bpnet

3.3. Install basepairmodels

``` pip install git+https://github.com/kundajelab/bpnet-refactor.git

```

3.4 Additional tools to install

For the following steps you will need samtools bamtools and bedGraphToBigWig, which are not installed as part of this repository.

The tools can be installed via the links below or using conda.

samtools

bamtools

bedGraphToBigWig (Linux 64-bit)

bedGraphToBigWig (Mac OSX 10.14.6)

``` conda install -y -c bioconda samtools=1.1 bedtools ucsc-bedgraphtobigwig

```

Tutorial

0. Optional additional walk through for downloading and preprocessing an example data:

Download-test-data

Preprocessing

Outlier_removal

Background_generation

1. Train a model!

Before we start training, we need to compile a json file that contains information about the input data. We will call this file input_data.json. Here is a sample json file that shows how to specify the input data information for the data. The data is organized into tasks and tracks. In this example we have one task and two tracks, the plus and the minus strand. Each task has 4 required keys, with values corresponding to tracks calculated in the preprocessing steps:

signal: the plus and minus bigwig tracks

loci: the bed file with the filtered peaks

background_loci (optional): the bed file with gc-matched background regions (source), and the ratio of positive-to-negative regions used when generating the gc-matched negatives file, expressed as a decimal (ratio)

bias (optional): the plus and minus control bigwig tracks

Note that the input_data.json file is used for multiple downstream steps.

{ "0": { "signal": { "source": ["ENCSR000EGM/data/plus.bw", "ENCSR000EGM/data/minus.bw"] }, "loci": { "source": ["ENCSR000EGM/data/peaks_inliers.bed"] }, "background_loci": { "source": ["ENCSR000EGM/data/gc_negatives.bed"], "ratio": [0.33] }, "bias": { "source": ["ENCSR000EGM/data/control_plus.bw", "ENCSR000EGM/data/control_minus.bw"], "smoothing": [null, null] } } }

Additionally, we need a json file to specify parameters for the BPNet architecture. Let's call this json file bpnet_params.json

{ "input_len": 2114, "output_profile_len": 1000, "motif_module_params": { "filters": [64], "kernel_sizes": [21], "padding": "valid" }, "syntax_module_params": { "num_dilation_layers": 8, "filters": 64, "kernel_size": 3, "padding": "valid", "pre_activation_residual_unit": true }, "profile_head_params": { "filters": 1, "kernel_size": 75, "padding": "valid" }, "counts_head_params": { "units": [1], "dropouts": [0.0], "activations": ["linear"] }, "profile_bias_module_params": { "kernel_sizes": [1] }, "counts_bias_module_params": { }, "use_attribution_prior": false, "attribution_prior_params": { "frequency_limit": 150, "limit_softness": 0.2, "grad_smooth_sigma": 3, "profile_grad_loss_weight": 200, "counts_grad_loss_weight": 100 }, "loss_weights": [1, 42], "counts_loss": "MSE" }

The loss_weights field has two values: the profile loss weight and the counts loss weight. Profile loss weight is always set to 1. Optimal counts loss weight can be automatically generated using the following command:

bpnet-counts-loss-weight --input-data input_data.json

Finally, we will make the splits.json file, which contains information about the chromosomes that are used for validation and test. Here is a sample that contains one split.

{ "0": { "test": ["chr7", "chr13", "chr17", "chr19", "chr21", "chrX"], "val": ["chr10", "chr18"], "train": ["chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr8", "chr9", "chr11", "chr12", "chr14", "chr15", "chr16", "chr21", "chr22", "chrY"] } }

Now that we have our data prepped, we can train our first model!!

The command to train a model is called bpnet-train.

You can kick start the training process by executing this command in your shell

``` BASEDIR=ENCSR000EGM DATADIR=$BASEDIR/data MODELDIR=$BASEDIR/models REFERENCEDIR=reference CHROMSIZES=$REFERENCEDIR/hg38.chrom.sizes REFERENCEGENOME=$REFERENCEDIR/hg38.genome.fa CVSPLITS=$BASEDIR/splits.json INPUTDATA=$BASEDIR/inputdata.json MODELPARAMS=$BASEDIR/bpnetparams.json

mkdir $MODELDIR bpnet-train \ --input-data $INPUTDATA \ --output-dir $MODELDIR \ --reference-genome $REFERENCEGENOME \ --chroms $(paste -s -d ' ' $REFERENCEDIR/chroms.txt) \ --chrom-sizes $CHROMSIZES \ --splits $CVSPLITS \ --model-arch-name BPNet \ --model-arch-params-json $MODELPARAMS \ --sequence-generator-name BPNet \ --model-output-filename model \ --input-seq-len 2114 \ --output-len 1000 \ --shuffle \ --threads 10 \ --epochs 100 \ --batch-size 64 \ --reverse-complement-augmentation \ --early-stopping-patience 10 \ --reduce-lr-on-plateau-patience 5 \ --learning-rate 0.001 ```

Note: It might take a few minutes for the training to begin once the above command has been issued, be patient and you should see the training eventually start.

2. Model Prediction

Once the training is complete we can generate predictions on the test set chromosomes to understand how well the model performs on regions not seen during training.

PREDICTIONS_DIR=$BASE_DIR/predictions_and_metrics mkdir $PREDICTIONS_DIR bpnet-predict \ --model $MODEL_DIR/model_split000 \ --chrom-sizes $CHROM_SIZES \ --chroms chr7 chr13 chr17 chr19 chr21 chrX \ --test-indices-file None \ --reference-genome $REFERENCE_GENOME \ --output-dir $PREDICTIONS_DIR \ --input-data $INPUT_DATA \ --sequence-generator-name BPNet \ --input-seq-len 2114 \ --output-len 1000 \ --output-window-size 1000 \ --batch-size 64 \ --reverse-complement-average \ --threads 2 \ --generate-predicted-profile-bigWigs

This script will output test metrics and also output bigwig tracks if the --generate-predicted-profile-bigWigs is specified

It is usefull to also make predictions on all the peaks as the model predictions provide denoised version of the observed signal. bpnet-predict \ --model $MODEL_DIR/model_split000 \ --chrom-sizes $CHROM_SIZES \ --chroms chr1 chr2 chrX chr3 chr4 chr5 chr6 chr7 chr10 chr8 chr14 chr9 chr11 chr13 chr12 chr15 chr16 chr17 chrY chr18 chr19 chrM \ --test-indices-file None \ --reference-genome $REFERENCE_GENOME \ --output-dir $PREDICTIONS_DIR \ --input-data $INPUT_DATA \ --sequence-generator-name BPNet \ --input-seq-len 2114 \ --output-len 1000 \ --output-window-size 1000 \ --batch-size 64 \ --reverse-complement-average \ --threads 2 \ --generate-predicted-profile-bigWigs

3. Compute importance scores

"Understanding why a model makes a certain prediction can be as crucial as the prediction’s accuracy in many applications". SHAP (SHapley Additive exPlanations) assigns each feature an importance value for a particular prediction. In our case we will assign each nucleotide an importance value or how much it contributed for the binding at each input region according to our model. A model that can predict the binding signal well at the test regions presumably must have learnt usefull features and thus the importance score values can shed light in to the driving sequence feature including the Transcription factor binding motifs.

``` SHAPDIR=$BASEDIR/shap mkdir $SHAPDIR bpnet-shap \ --reference-genome $REFERENCEGENOME \ --model $MODELDIR/modelsplit000 \ --bed-file $DATADIR/peaksinliers.bed \ --output-dir $SHAPDIR \ --input-seq-len 2114 \ --control-len 1000 \ --task-id 0 \ --input-data $INPUTDATA \ --chrom-sizes $CHROM_SIZES \ --generate-shap-bigWigs

can also provide for example --chroms chr1 to generate shap only on a subset of chromosomes (chr1)

```

This script will output shap bigwig tracks which are usefull for visualizing in a genome browser the bases/motif that are important for binding example if --generate-shap-bigWigs is specified.

UCSC, WashU browsers currently support visualizing the importance scores - use the Dynseq option.

Please cite:

The dynseq browser track shows context-specific features at nucleotide resolution

S Nair, A Barrett, D Li, BJ Raney, BT Lee, P Kerpedjiev, V Ramalingam, ... Nature genetics 54 (11), 1581-1583

4. Discover motifs with TF-modisco

"TF-MoDISco is a biological motif discovery algorithm that differentiates itself by using attribution scores from a machine learning model, in addition to the sequence itself, to guide motif discovery. Using the attribution scores, as opposed to just the sequence, can be beneficial because the attributions fine-map the specific sequence drivers of biology. Although in many of our examples this model is BPNet and the attributions are from DeepLIFT/DeepSHAP, there is no limit on what attribution algorithm is used, or what model the attributions come from."

This step allows one to get a list of sequence motifs that drive the signal in the studied experiment.

Use the newer version of TF-modisco called modisco-lite to discover the motifs. Support to directly use modisco-lite from the BPNet repo will be added later. For now use: https://github.com/jmschrei/tfmodisco-lite

```

for example

conda create --name tfmodisco python=3.10 conda activate tfmodisco pip install modisco-lite

MODISCODIR=$BASEDIR/modisco MODISCOCOUNTSDIR=$BASEDIR/modisco/counts MODISCOPROFILEDIR=$BASEDIR/modisco/profile mkdir -p $MODISCOCOUNTSDIR mkdir -p $MODISCOPROFILEDIR

modisco motifs\ --maxseqlets 50000 \ --h5py $SHAPDIR/profilescores.h5 \ -o $MODISCOPROFILEDIR/profilemodiscoscores.h5 \ --trimsize 20 \ --initialflanktoadd 5 \ --finalflanktoadd 10

modisco motifs\ --maxseqlets 50000 \ --h5py $SHAPDIR/countsscores.h5 \ -o $MODISCOCOUNTSDIR/countsmodiscoscores.h5 \ --trimsize 20 \ --initialflanktoadd 5 \ --finalflanktoadd 10

```

5. Discover location of the identified motifs with FiNeMo

While TF-modisco allows one to get a list of sequence motifs, FiNeMo allows one to map the location of these motifs in all the input regions. FiNeMO is a GPU-accelerated hit caller for retrieving TFMoDISCo motif occurences from machine-learning-model-generated contribution scores.

Support to directly use FiNeMO from the BPNet repo will be added later. For now use: https://github.com/austintwang/finemo_gpu

```

for example

git clone https://github.com/austintwang/finemogpu.git cd finemogpu

conda env create -f environment.yml -n finemo conda activate finemo pip install --editable .

cd .. HITSDIRCOUNTS=$BASEDIR/hits/counts mkdir -p HITSDIR_COUNTS

finemo extract-regions-bpnet-h5 -c $SHAPDIR/countsscores.h5 -o ${HITSDIRCOUNTS}/regionsbw.npz -w 400 -p $BASEDIR/data/peaks_inliers.bed

finemo call-hits -l ${lambda} -r ${HITSDIRCOUNTS}/regionsbw.npz -m ${modiscoh5} -p $BASEDIR/data/peaksinliers.bed -C $CHROMSIZES -o ${HITSDIR_COUNTS} -t 0.7 --compile

finemo report -H ${HITSDIRCOUNTS}/hits.tsv -r ${HITSDIRCOUNTS}/regionsbw.npz -m ${modiscoh5} -p $BASEDIR/data/peaksinliers.bed -o ${HITSDIRCOUNTS}/ -t 0.7 -W 400 ```

How to Cite

If you're using BPNet in your work, please cite the original BPNet paper:

Avsec, Ž., Weilert, M., Shrikumar, A. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet 53, 354–366 (2021).

And reach out to Vivekanandan Ramalingam and Anshul Kundaje to discuss how to cite this update version. An preprint associated with this new version will soon be released and will be pointed out here.

Owner

Name: Kundaje Lab
Login: kundajelab
Kind: organization
Location: Stanford University

Website: http://anshul.kundaje.net
Repositories: 117
Profile: https://github.com/kundajelab

Compbio and machine learning code repositories from the Kundaje Lab at Stanford Genetics and Computer Science Depts.

GitHub Events

Total

Issues event: 14
Watch event: 4
Delete event: 1
Issue comment event: 7
Push event: 15
Pull request review event: 2
Pull request event: 2
Gollum event: 4
Fork event: 4
Create event: 4

Last Year

Issues event: 14
Watch event: 4
Delete event: 1
Issue comment event: 7
Push event: 15
Pull request review event: 2
Pull request event: 2
Gollum event: 4
Fork event: 4
Create event: 4

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 10
Total pull requests: 1
Average time to close issues: 6 months
Average time to close pull requests: 3 days
Total issue authors: 7
Total pull request authors: 1
Average comments per issue: 0.9
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 7
Pull requests: 1
Average time to close issues: 16 days
Average time to close pull requests: 3 days
Issue authors: 6
Pull request authors: 1
Average comments per issue: 1.14
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

vhecht (6)
manoloff (2)
G-Chapman (1)
patrickgohl (1)
pengjk689 (1)
jamesgalante (1)
arodel21 (1)

https://github.com/kundajelab/bpnet-refactor

Science Score: 36.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

BPNet

Table of contents

Installation

1. Docker

2. Anvil

3. Local Installation

3.1. Install Miniconda

3.2. Create new virtual environment

3.3. Install basepairmodels

3.4 Additional tools to install

Tutorial

0. Optional additional walk through for downloading and preprocessing an example data:

1. Train a model!

2. Model Prediction

3. Compute importance scores

can also provide for example --chroms chr1 to generate shap only on a subset of chromosomes (chr1)

Please cite:

4. Discover motifs with TF-modisco

for example

5. Discover location of the identified motifs with FiNeMo

for example

How to Cite

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels