chrombpnet

Bias factorized, base-resolution deep learning models of chromatin accessibility (chromBPNet)

https://github.com/kundajelab/chrombpnet

Last synced: 9 months ago · JSON representation ·

Repository

Bias factorized, base-resolution deep learning models of chromatin accessibility (chromBPNet)

Basic Info

Host: GitHub
Owner: kundajelab
License: mit
Language: Jupyter Notebook
Default Branch: master
Homepage: https://github.com/kundajelab/chrombpnet/wiki
Size: 370 MB

Statistics

Stars: 169
Watchers: 46
Forks: 47
Open Issues: 7
Releases: 14

Created almost 5 years ago · Last pushed 11 months ago

Metadata Files

Readme Changelog License Citation

Bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants

This repo contains code for the paper ChromBPNet: Bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants by Anusri Pampari, Anna Shcherbina, Anshul Kundaje. (*authors contributed equally)
Please contact Anusri Pampari for suggestions and comments.
Here is a link to the slides, ISMB talk and a comprehensive tutorial. Please see the FAQ and file a github issue if you have questions.
If you are using chrombpnet <= v0.1.3 please refer to the note here - https://github.com/kundajelab/chrombpnet/wiki/Denovo-motif-discovery
If you are using chrombpnet repo actively in your project, I strongly recommend adding yourself to the watchers list for updates. Click on the eye symbol (below the star and above the fork symbol to the right). This will keep you informed of all the major updates and bugs posted for this repo.

Chromatin profiles (DNASE-seq and ATAC-seq) exhibit multi-resolution shapes and spans regulated by co-operative binding of transcription factors (TFs). This complexity is further difficult to mine because of confounding bias from enzymes (DNASE-I/Tn5) used in these assays. Existing methods do not account for this complexity at base-resolution and do not account for enzyme bias correctly, thus missing the high-resolution architecture of these profiles. Here we introduce ChromBPNet to address both these aspects.

ChromBPNet (shown in the image as Bias-Factorized ChromBPNet) is a fully convolutional neural network that uses dilated convolutions with residual connections to enable large receptive fields with efficient parameterization. It also performs automatic assay bias correction in two steps, first by learning simple model on chromatin background that captures the enzyme effect (called Frozen Bias Model in the image). Then we use this model to regress out the effect of the enzyme from the ATAC-seq/DNASE-seq profiles. This two step process ensures that the sequence component of the ChromBPNet model (called TF Model) does not learn enzymatic bias.

ChromBPNet

Installation

This section will discuss the packages needed to train a ChromBPNet model. Firstly, it is recommended that you use a GPU for model training and have the necessary NVIDIA drivers and CUDA already installed. You can verify that your machine is set up to use GPU's properly by executing the nvidia-smi command and ensuring that the command returns information about your system GPU(s) (rather than an error). Secondly there are two ways to ensure you have the necessary packages to train ChromBPNet models which we detail below,

1. Running in docker

Download and install the latest version of Docker for your platform. Here is the link for the installers -Docker Installers. Run the docker run command below to open an environment with all the packages installed and do cd chrombpnet to start running the tutorial.

Note: To access your system GPU's from within the docker container, you must have NVIDIA Container Toolkit installed on your host machine.

docker run -it --rm --memory=100g --gpus device=0 kundajelab/chrombpnet:latest

2. Local installation

Create a clean conda environment with python >=3.8 conda create -n chrombpnet python=3.8 conda activate chrombpnet

Install non-Python requirements via conda conda install -y -c conda-forge -c bioconda samtools bedtools ucsc-bedgraphtobigwig pybigwig meme

Install from pypi

pip install chrombpnet

Install from source

git clone https://github.com/kundajelab/chrombpnet.git pip install -e chrombpnet

QuickStart

Bias-factorized ChromBPNet training

The command to train ChromBPNet with pre-trained bias model will look like this:

chrombpnet pipeline \ -ibam /path/to/input.bam \ # only one of ibam, ifrag or itag is accepted -ifrag /path/to/input.tsv \ # only one of ibam, ifrag or itag is accepted -itag /path/to/input.tagAlign \ # only one of ibam, ifrag or itag is accepted -d "ATAC" \ -g /path/to/hg38.fa \ -c /path/to/hg38.chrom.sizes \ -p /path/to/peaks.bed \ -n /path/to/nonpeaks.bed \ -fl /path/to/fold_0.json \ -b /path/to/bias.h5 \ -o path/to/output/dir/ \

Input Format

-ibam or -ifrag or -itag: input file path with filtered reads in one of bam, fragment or tagalign formats. Example files for supported types - bam, fragment, tagalign
-d: assay type. The following types are supported - "ATAC" or "DNASE"
-g: reference genome fasta file. Example file human reference - hg38.fa
-c: chromosome and size tab separated file. Example file in human reference - hg38.chrom.sizes
-p: Input peaks in narrowPeak file format, and must have 10 columns, with values minimally for chr, start, end and summit (10th column). Every region is centered at start + summit internally, across all regions. Example file with ENCSR868FGK dataset - peaks.bed
-n: Input nonpeaks (background regions)in narrowPeak file format, and must have 10 columns, with values minimally for chr, start, end and summit (10th column). Every region is centered at start + summit internally, across all regions. Example file with ENCSR868FGK dataset - nonpeaks.bed. More instructions on how to make your own nonpeak file can be found in the Preprocessing guide.
-fl: json file showing split of chromosomes for train, test and valid. Example 5 fold jsons for human reference - folds
-b: Bias model in .h5 format. Bias models are generally transferable across assay types following similar protocol. Repository of pre-trained bias models for use here. Instructions to train custom bias model below.
-o: Output directory path

Please find scripts and best practices for preprocssing here.

Output Format

The ouput directory will be populated as follows -

``` models\ biasmodelscaled.h5 chrombpnet.h5 chrombpnet_nobias.h5 (TF-Model i.e model to predict bias corrected accessibility profile) logs\ chrombpnet.log (loss per epoch) chrombpnet.log.batch (loss per batch per epoch) (..other hyperparameters used in training)

auxilary\ filtered.peaks filtered.nonpeaks ...

evaluation\ overallreport.pdf overallreport.html bwshiftqc.png biasmetrics.json chrombpnetmetrics.json chrombpnetonlypeaks.countspearsonr.png chrombpnetonlypeaks.profilejsd.png chrombpnetnobiasprofilemotifs.pdf chrombpnetnobiascountsmotifs.pdf chrombpnetnobiasmaxbiasresponse.txt chrombpnet_nobias.....footprint.png ... ``` Detailed usage guide with more information on input arguments and the output file formats and how to work with them are provided here and here.

For more information, also see:

Bias Model training

The command to train a custom bias bias model will look like this:

chrombpnet bias pipeline \ -ibam /path/to/input.bam \ # only one of ibam, ifrag or itag is accepted -ifrag /path/to/input.tsv \ # only one of ibam, ifrag or itag is accepted -itag /path/to/input.tagAlign \ # only one of ibam, ifrag or itag is accepted -d "ATAC" \ -g /path/to/hg38.fa \ -c /path/to/hg38.chrom.sizes \ -p /path/to/peaks.bed \ -n /path/to/nonpeaks.bed \ -fl /path/to/fold_0.json \ -b 0.5 \ -o path/to/output/dir/ \

Input Format

-ibam or -ifrag or -itag: input file path with filtered reads in one of bam, fragment or tagalign formats. Example files for supported types - bam, fragment, tagalign
-d: assay type. Following types are supported - "ATAC" or "DNASE"
-g: reference genome fasta file. Example file human reference - hg38.fa
-c: chromosome and size tab separated file. Example file in human reference - hg38.chrom.sizes
-p: Input peaks in narrowPeak file format, and must have 10 columns, with values minimally for chr, start, end and summit (10th column). Every region is centered at start + summit internally, across all regions. Example file with ENCSR868FGK dataset - peaks.bed
-n: Input nonpeaks (background regions)in narrowPeak file format, and must have 10 columns, with values minimally for chr, start, end and summit (10th column). Every region is centered at start + summit internally, across all regions. Example file with ENCSR868FGK dataset - nonpeaks.bed
-f: json file showing split of chromosomes for train, test and valid. Example 5 fold jsons for human reference - folds
-o: Output directory path

Please find scripts and best practices for preprocessing here.

Output Format

The output directory will be populated as follows -

``` models\ bias.h5 logs\ bias.log (loss per epoch) bias.log.batch (loss per batch per epoch) (..other hyperparameters used in training)

intermediates\ ...

evaluation\ overallreport.html overallreport.pdf pwmfrominput.png k562epochloss.png biasmetrics.json biasonlypeaks.countspearsonr.png biasonlypeaks.profilejsd.png biasonlynonpeaks.countspearsonr.png biasonlynonpeaks.profilejsd.png biaspredictions.h5 biasprofile.pdf biascounts.pdf ... ``` Detailed usage guide with more information on the input arguments and output file formats and how to work with them are provided here and here.

For more information, also see:

How to Cite

If you're using ChromBPNet in your work, please cite as follows:

@article {Pampari2024.12.25.630221, author = {Pampari, Anusri and Shcherbina, Anna and Kvon, Evgeny and Kosicki, Michael and Nair, Surag and Kundu, Soumya and Kathiria, Arwa S. and Risca, Viviana I. and Kuningas, Kristiina and Alasoo, Kaur and Greenleaf, William James and Pennacchio, Len A. and Kundaje, Anshul}, title = {ChromBPNet: bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants}, elocation-id = {2024.12.25.630221}, year = {2024}, doi = {10.1101/2024.12.25.630221}, publisher = {Cold Spring Harbor Laboratory}, URL = {https://www.biorxiv.org/content/early/2024/12/25/2024.12.25.630221}, eprint = {https://www.biorxiv.org/content/early/2024/12/25/2024.12.25.630221.full.pdf}, journal = {bioRxiv} }

Owner

Name: Kundaje Lab
Login: kundajelab
Kind: organization
Location: Stanford University

Website: http://anshul.kundaje.net
Repositories: 117
Profile: https://github.com/kundajelab

Compbio and machine learning code repositories from the Kundaje Lab at Stanford Genetics and Computer Science Depts.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Pampari"
  given-names: "Anusri"
  orcid: "https://orcid.org/0000-0002-6579-4070"
- family-names: "Shcherbina"
  given-names: "Anna"
- family-names: "Nair"
  given-names: "Surag"
- family-names: "Schreiber"
  given-names: "Jacob"
- family-names: "Patel"
  given-names: "Aman"
- family-names: "Wang"
  given-names: "Austin"
- family-names: "Kundu"
  given-names: "Soumya"
- family-names: "Shrikumar"
  given-names: "Avanti"
- family-names: "Kundaje"
  given-names: "Anshul"
title: "Bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants."
version: 0.1.1
doi: 10.5281/zenodo.7567627
date-released: 2023-01-24
url: "https://github.com/kundajelab/chrombpnet"

GitHub Events

Total

Create event: 5
Release event: 6
Issues event: 73
Watch event: 52
Delete event: 3
Issue comment event: 113
Push event: 12
Pull request event: 1
Gollum event: 15
Fork event: 14

Last Year

Create event: 5
Release event: 6
Issues event: 73
Watch event: 52
Delete event: 3
Issue comment event: 113
Push event: 12
Pull request event: 1
Gollum event: 15
Fork event: 14

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 38
Total pull requests: 1
Average time to close issues: 2 months
Average time to close pull requests: N/A
Total issue authors: 29
Total pull request authors: 1
Average comments per issue: 2.21
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 34
Pull requests: 1
Average time to close issues: 22 days
Average time to close pull requests: N/A
Issue authors: 25
Pull request authors: 1
Average comments per issue: 2.12
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Brawni (5)
arodel21 (4)
saninta0212 (4)
mfansler (3)
cmlakhan (3)
Manonbaudic (2)
zhenyu7500 (2)
viramalingam (2)
WhatMelonGua (2)
GGboy-Zzz (2)
jtls0n (2)
allthingsgenome (2)
linzyzhao2002 (2)
sid5427 (2)
WanyingX (1)

Pull Request Authors

mfansler (1)
hdbeukel (1)
terencewtli (1)
sadhanagaddam3 (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 358 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 9
Total maintainers: 2

pypi.org: chrombpnet

chrombpnet predicts chromatin accessibility from sequence

Documentation: https://chrombpnet.readthedocs.io/
License: MIT
Latest release: 1.0.1
published 11 months ago

Versions: 9
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 358 Last month

Rankings

Dependent packages count: 10.1%

Downloads: 14.7%

Average: 30.7%

Dependent repos count: 67.1%

Maintainers (2)

annashcherbina anusri

Last synced: 10 months ago

chrombpnet

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants

Table of contents

Installation

1. Running in docker

2. Local installation

Install from pypi

Install from source

QuickStart

Bias-factorized ChromBPNet training

Input Format

Output Format

Bias Model training

Input Format

Output Format

How to Cite

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: chrombpnet

Rankings

Maintainers (2)

Dependencies