replitimer
Step-by-step instructions and Snakemake pipeline for processing Replication Timing Data
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 6 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary
Keywords
Repository
Step-by-step instructions and Snakemake pipeline for processing Replication Timing Data
Basic Info
Statistics
- Stars: 1
- Watchers: 0
- Forks: 1
- Open Issues: 6
- Releases: 14
Topics
Metadata Files
README.md
RepliTimer
Project Description:
RepliTimer processes short whole-genome sequencing reads from any organism's G1 and S phase cells into replication timing values. To enable step-by-step data processing, we describe each of the individual data processing steps. We provide a Snakemake pipeline with clearly defined dependencies and Anaconda environments to automate the pipeline. We include a compact dataset in the repository that you may use to test the pipeline. We also provide an example detailing how RepliTimer can be used to process publicly available replication timing data from zebrafish.
Table of contents:
- Requirements
- Description of individual steps in pipeline
- Step-by-step instructions on running Snakemake pipeline:
- 1. Load slurm and miniconda
- 2. Clone repository
- 3. Start the conda environment
- 3A. FIRST TIME ONLY: Setup conda environment
- 3B. Activate conda environment
- 4. Modify the job-specific configuration files.
- 4A. Modify the config/config.yml file
- 4B. Modify the config/samples.csv file
- 4C. IF SLURM RESOURCE CHANGES ARE NEEDED. Modify the config/cluster_config.yml file
- 5. Do a dry run
- 6. Make a DAG diagram
- 7. Run on cluster with slurm
- 7A. Use conda environments
- 7B. Use modules
- Output structure
- Examples
- References
Requirements
General requirements:
Requirement|Description :-----:|:-----: Raw RT data|separate .fastq files from paired-end whole genome sequencing of S phase and matched G1 cells Indexed genome|your genome of interest indexed for bwa RT windows|.bed file with coordinates of replication timing windows. RT windows file for the zebrafish GRCz11 genome is provided. Blacklisted regions|.bed file with blacklisted regions. Blacklisted file provided for zebrafish GRCz11 is provided. Greylisted regions|.bed file with greylisted regions. Greylisted file for zebrafish GRCz11 that works well for the Tab5 strain is provided. Software|fastp/0.23.2, bwa/0.7.15, samtools/1.14, picard/2.21.2, bamtools/2.5.1, bedtools/2.30.0, R/4.1.2-mkl R Packages|GenomicRanges, SummarizedExperiment, pspline, bamsignals, matrixStats
Snakemake pipeline requirements:
Requirement|Description :-----:|:-----: Software|miniconda/4.11.0 (for Conda environments), python/3.10.2, numpy/1.2.2, pandas/1.4.1, snakemake-minimal/7.3.1 Snakemake config file|.csv file with snakemake parameters in the config/ directory Snakemake cluster config file|.csv file with cluster configuration parameters in the config/directory
Description of individual steps in pipeline:
The data are processed according to Siefert, 2017 and Siefert, 2018 with some modifications. Paired-end sequencing reads are trimmed with fastp and then aligned to the genome using BWA-mem. Aligned reads are filtered using picard tools, bamtools, and bedtools. Reads are counted in user selected replication timing windows using the R bamsignals package with a custom R script. The S/G1 quotients are calculated for each window using a custom R script and then smoothed with a with a cubic smoothing spline through a custom R script and the pspline R package. The smoothed quotients are then transformed into log2 values or z-scores. If replicate data is provided, median values are calculated. Individual and median data for all samples are provided in an R RangedSummarizedExperiment (.rds) file and in individual .bedgraph files.

1. trimreadswith_fastp
bash
fastp \
-i {input.fq1} \
-I {input.fq2} \
-o {output.trimmed1} \
-O {output.trimmed2} \
-h {output.fastp_report} \
--json {output.fastp_json} \
-R "{wildcards.sample}" \
-w {params.threads}
2. runbwamem
```bash
align pair of .fastq files to the genome and convert the .sam output to .bam
bwa mem \ -M \ -t {params.threads} \ {params.genome} \ {input.R1} {input.R2} | \ samtools sort \ -@ {params.threads} > {output.bam}
samtools index \ -@ {params.threads} \ {output.bam} > {output.bai} ```
3. markduplicateswith_picard
```bash JAVAMEMOPTS={params.memory}
RAM=$(echo $JAVAMEMOPTS | rev | cut -c2- | rev)
picard MarkDuplicates \ COMPRESSIONLEVEL=9 \ VALIDATIONSTRINGENCY=LENIENT \ MAXRECORDSINRAM=$((200000*RAM)) \ CREATEINDEX=true \ I={input.bam} \ O={output.markedbam} \ M={output.markedmetrics}
samtools index {output.marked_bam} ```
4. qualityfilterwith_bamtools
```bash bamtools filter \ -in {input.markedbam} \ -forceCompression \ -mapQuality '{params.mapQ}' \ -isDuplicate false \ -isFailedQC false \ -isMapped true \ -isMateMapped true \ -isPaired true \ -isPrimaryAlignment true \ -isProperPair true \ -out {output.filteredbam}
samtools index {output.filtered_bam} ```
5. blacklistfilterwith_bedtools
```bash bedtools intersect \ -v \ -split \ -abam {input.filteredbam} \ -b {params.HiCountBEDFILENAME} > {input.filteredbam}_FILTERED1.BAM
bedtools intersect \ -v \ -split \ -f 0.3 \ -abam {input.filteredbam}FILTERED1.BAM \ -b {params.SatBEDFILENAME} > {output.doubleFiltered_bam}
rm {input.filteredbam}FILTERED1.BAM
samtools index {output.doubleFiltered_bam} ```
6. countreadsinRTwindows
bash
Rscript \
workflow/scripts/CountReadsOverBed_ver02.R \
{params.RT_windows} \
{input.doubleFiltered_bam} \
{output.counts}
7. Merge count tables
bash
SAMPLES=( {params.sample_list} )
BEDGRAPHS=( {params.counts_bedgraphs_list} )
TEMP_COUNTS=( {params.counts_temp_list} )
mkdir results/temp_merge
for i in ${{!BEDGRAPHS[@]}}; do
echo -e "\t\t\t${{SAMPLES[i]}}" > ${{TEMP_COUNTS[i]}}.header
cut -f4 ${{BEDGRAPHS[i]}} > ${{TEMP_COUNTS[i]}}.counts
cat ${{TEMP_COUNTS[i]}}.header ${{TEMP_COUNTS[i]}}.counts > ${{TEMP_COUNTS[i]}}
rm ${{TEMP_COUNTS[i]}}.header
rm ${{TEMP_COUNTS[i]}}.counts
done
paste {params.counts_temp_list} > {output.merged}
rm -rf results/temp_merge/
Rscript workflow/scripts/MakeCountsRSE_ver01.R \
{output.merged} \
{params.samples_table} \
{params.RT_windows} \
{output.rse_counts}
8. Process count tables
bash
Rscript workflow/scripts/CalculateQuotientsSmoothScale_ver01.R \
{input.rse_counts} \
{output.rse_processed}
9. Make bedgraphs
bash
mkdir Log2Ratios
mkdir ZScores
mkdir Smoothed
mkdir Quotients
Rscript workflow/scripts/Generate_RT_Bedgraphs_ver01.R \
{input.rse_processed} \
"Log2Ratios"
Rscript workflow/scripts/Generate_RT_Bedgraphs_ver01.R \
{input.rse_processed} \
"ZScores"
Rscript workflow/scripts/Generate_RT_Bedgraphs_ver01.R \
{input.rse_processed} \
"Smoothed"
Rscript workflow/scripts/Generate_RT_Bedgraphs_ver01.R \
{input.rse_processed} \
"Quotients"
tar -czvf {output.Log2Ratios_bedgraphs} Log2Ratios/
tar -czvf {output.ZScores_bedgraphs} ZScores/
tar -czvf {output.Smoothed_bedgraphs} Smoothed/
tar -czvf {output.Quotients_bedgraphs} Quotients/
rm -rf Log2Ratios
rm -rf ZScores
rm -rf Smoothed
rm -rf Quotients
Step-by-step instructions on running Snakemake pipeline:
1. Load slurm and miniconda
Note. The commands to do this will be different on your machine. These commands are specific to an HPC using slurm with these modules installed.
bash
ml slurm/20.02
ml miniconda/4.11.0
2. Clone repository
```bash git clone https://github.com/SansamLab/RepliTimer.git
rename folder with project name
mv RepliTimer/ MyRTProject_Folder/
change directory into root of your project folder
cd MyRTProject_Folder ```
3. Start the conda environment
3A. FIRST TIME ONLY: Setup conda environment with snakemake
```bash
-f is the location of the environment .yml file.
The relative path assumes that you are in the root directory of this repository.
-p is the path where you want to install this environment
conda env create -f workflow/envs/SnakemakeEnv2.yml -p /s/sansam-lab/SnakemakeEnv2 ```
3B. Activate conda environment with snakemake
bash
conda activate /s/sansam-lab/SnakemakeEnv2
4. Modify the job-specific configuration files.
4A. Modify the config/config.yml file
You must enter the following: * samplestable: * path and name of the .csv file that lists sample names and paths to all fastq files * example: "config/testSamples.csv" * fastpcpus: * number of processors that fastp will use to trim your sequencing reads * example: 8 * bwacpus: * number of processors that bwa mem will use to align reads to genome * example: 12 * picardmemory: * amount of memory allocated to picard for marking duplicates * example: "24G" * bwagenome: * location of bwa indexed genome for the alignment * example: "/Volumes/htscore/Shared/zebrafish/danRer11/noAlts/danRer11noAlts.fa" * bamtoolsfiltermapQ: * mapQ score cutoff for quality filtering reads * example: ">10" * bedtoolsintersectblacklist: * .bed file with blacklisted genomic regions for filtering out reads * example: "resources/Satellites.bed" * for zebrafish GRCz11 * file has regions with certain repeat types that we find are associated with anomalously high read counts * bedtoolsintersectgreylist: * .bed file with greylisted genomic regions for filtering out reads * example: "resources/HiCountWindows.bed" * for zebrafish GRCz11 * file has regions that have anomalously high read counts in multiple Tab5 whole genome sequencing runs * RTwindows: * .bed file with regions used for calculating replication timing values * example: "resources/RTWindowsdanRer11noAlts_ver01.bed" * for zebrafish GRCz11
4B. Modify the config/samples.csv file
The samples.csv file in the config folder has paths to the test fastq files. You must replace those paths with those for your own fastq files. The first column of each row is the sample name. This name will be used for all output files.
4C. IF SLURM RESOURCE CHANGES ARE NEEDED. Modify the config/cluster_config.yml file
CPU and memory requests for each rule in the pipeline are detailed in this file. If you are using SLURM, you may need to alter this file to fit your needs/system.
5. Do a dry run.
A dry run produces a text output showing exactly what commands will be executed. Look this over carefully before submitting the full job. It is normal to see warnings about changes made to the code, input, and params.
bash
snakemake -npr
6. Make a DAG diagram.
bash
snakemake --dag | dot -Tpdf > dag.pdf
7. Run on cluster with slurm.
This snakemake pipeline could be executed without slurm, but if an hpc with slurm is used, the following will start the pipeline with the parameters defined in the config/cluster_config.yml file.
7A. Use conda environments
If conda is to be used for rule-specific environments, you may find it useful to create the environments first. Running 'snakemake' with the '--conda-create-envs-only' option will create the environments without running the pipeline. The '--conda-prefix' option is used to set a directory in which the conda and conda-archive directories are created. This directory may be changed to a stable or shared location.
bash
sbatch --mem 32G \
--wrap="\
snakemake \
--cores all \
--use-conda \
--conda-prefix ../condEnvs/ \
--conda-create-envs-only \
--conda-frontend conda"
Once the environments are setup, you may execute pipeline with conda environments using the following command:
bash
sbatch --constraint=westmere \
--wrap="\
snakemake \
-R \
-j 999 \
--use-conda \
--conda-prefix ../condEnvs/ \
--conda-frontend conda \
--latency-wait 100 \
--cluster-config config/cluster_config.yml \
--cluster '\
sbatch \
-A {cluster.account} \
-p {cluster.partition} \
--cpus-per-task {cluster.cpus-per-task} \
--mem {cluster.mem} \
--output {cluster.output}'"
7B. Use environment modules.
Rather than using conda environments, you may prefer to use modules installed on your computing cluster. These modules are defined for each rule in 'workflow/Snakefile'. This must be customized for your environment, and you must modify the Snakefile yourself.
To execute the pipeline with environment modules, enter the following:
bash
sbatch --constraint=westmere \
--wrap="\
snakemake \
-R \
-j 999 \
--use-envmodules \
--latency-wait 100 \
--cluster-config config/cluster_config.yml \
--cluster '\
sbatch \
-A {cluster.account} \
-p {cluster.partition} \
--cpus-per-task {cluster.cpus-per-task} \
--mem {cluster.mem} \
--output {cluster.output}'"
8. Check results, and when finished, exit environment.
The results will be saved to the "results" folder. Look over log files generated in either the logs/ or logs/snakelogs folders (depending on whether slurm was used).
bash
conda deactivate
Output structure:
Folder|Description :-----:|:-----: logs|Log file for each snakemake rule executed qc|Quality control files generated by fastp trimmed|.fastq files trimmed by fastq aligned|.bam files aligned by bwamem duplicatesMarkedBam|.bam files with duplicates marked by picardtools filteredBams|.bam files filtered for read quality doubleFilteredBams|.bam files also filtered using blacklisted and greylisted regions counts|.bedgraph files with counts for each sample merged|All counts in a .txt tab delimited table rse|.rds file with RangedSummarized Experiment Object with Counts processed_rse|.rds file with RangedSummarizedExperiment Object with quotients, smoothed, Log2, and ZScores for all S phase samples. Medians values for replicates are also included bedgraphs|Zipped bedgraphs with calculated values from processed_rse
Examples
Run Snakemake pipeline with sample data included in repository Run Snakemake pipeline with Siefert, 2017 data.
References:
Replication timing data processing
Siefert JC, Georgescu C, Wren JD, Koren A, Sansam CL. DNA replication timing during development anticipates transcriptional programs and parallels enhancer activation. Genome Res. 2017;27(8):1406-1416.
Siefert JC, Clowdus EA, Goins D, Koren A, Sansam CL. Profiling dna replication timing using zebrafish as an in vivo model system. JoVE. 2018;(134):57146.
fastp
Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34(17):i884-i890. doi:10.1093/bioinformatics/bty560
Samtools
Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of samtools and bcftools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008
BWA
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv:1303.3997 [q-Bio]. http://arxiv.org/abs/1303.3997
Picard tools
http://broadinstitute.github.io/picard/
bamtools
Barnett DW, Garrison EK, Quinlan AR, Stromberg MP, Marth GT. BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics. 2011;27(12):1691-1692.
bedtools
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841-842.
R
R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
bamsignals
Alessandro Mammana and Johannes Helmuth (2021). bamsignals: Extract read count signals from bam files. R package version 1.26.0. https://github.com/lamortenera/bamsignals
pspline
S original by Jim Ramsey. R port by Brian Ripley ripley@stats.ox.ac.uk. (2022). pspline: Penalized Smoothing Splines. R package version 1.0-19. https://CRAN.R-project.org/package=pspline
SummarizedExperiment
Martin Morgan, Valerie Obenchain, Jim Hester and Herv Pags (2021). SummarizedExperiment: SummarizedExperiment container. R package version 1.24.0. https://bioconductor.org/packages/SummarizedExperiment
GenomicRanges
Lawrence M, Huber W, Pag`es H, Aboyoun P, Carlson M, et al. (2013) Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8): e1003118. doi:10.1371/journal.pcbi.1003118
matrixstats
Henrik Bengtsson (2021). matrixStats: Functions that Apply to Rows and Columns of Matrices (and to Vectors). R package version 0.61.0. https://CRAN.R-project.org/package=matrixStats
Snakemake
Kster, J., & Rahmann, S. (2012). SnakemakeA scalable bioinformatics workflow engine. Bioinformatics (Oxford, England), 28(19), 25202522. https://doi.org/10.1093/bioinformatics/bts480
Python
Anaconda Software Distribution. (2020). Anaconda Documentation. Anaconda Inc. Retrieved from https://docs.anaconda.com/
Owner
- Name: Sansam Lab Genomics Pipelines
- Login: SansamLab-Pipelines-Genomics
- Kind: organization
- Twitter: SansamLab
- Repositories: 5
- Profile: https://github.com/SansamLab-Pipelines-Genomics