https://github.com/backmanlab/cut-tag

A repository to hold code for Backman Lab Cut&Tag data analysis

Last synced: 9 months ago · JSON representation

Repository

A repository to hold code for Backman Lab Cut&Tag data analysis

Basic Info

Host: GitHub
Owner: BackmanLab
Language: Shell
Default Branch: main
Size: 43 KB

Statistics

Stars: 1
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 0

Created almost 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme

Cut&Tag Data Processing Guide

Intro:

Cleavage Under Targets & Tagmentation (CUT&Tag)is an ChIP-like profiling strategy in which primary antibodies are bound to chromatin proteins in situ after nuclei have been permeabilized. Secondary antibodies covalently bound to the cut-and-paste transposase Tn5 are then bound to the primary antibody. Activation of the transposase simultaneously digests DNA and adds high throughput sequencing adapters (referred to as ‘tagmentation’) for paired-end DNA sequencing. Cut&Tag libraries have a much higher signal-to-noise ratio than traditional ChIP libraries and need far fewer cells as well.

Below is documentation for set-up and a single script for parsing Cut&Tag data based off of the data analysis protocol presented here from the Cut&Tag protocol and paper. Additionally, there is an nf-core pipeline for Cut&Tag. Because I want more control over the processing of the data, I will write my own wrapper based on the Henikoff Lab's data analysis protocol.

I submitted three groups of Cut&Tag biological replicates on 6.03.2024, 07.01.2024, and 7.19.2024. The later two submissions have technical replicates in them that should be combined. The sample identifiers for each are below:

6.03.2024 - Submission 1:

|Cells | AB | Index | NuSeq ID | |---------------|--------|-----------|----------| | WT | Pol1AB | i71-i51 | BM101 | | Pol1 degraded | Pol1AB | i71-i52 | BM102 | | Pol1 degraded | K4Me3AB| i71-i53 | BM104 | | WT |K4Me3AB | i71-i54 |BM103 | | WT |Pol2AB |i72-i51 |BM105 | | Pol 1 degraded|Pol2AB |i72-i52 |BM106 | | WT |K9Me3AB |i72-i53 |BM107 | | Pol 1 degraded|K9Me3AB |i72-i54 |BM108 |

07.01.2024 - Submission 2:

|Cells Rep1 | AB | Index | NuSeq ID | |---------------|--------|-----------|----------| |WT |Pol1AB |i71-i51 |BM109 | |Pol 1 degraded |Pol1AB |i71-i52 |BM110 | |Pol 2 degraded |Pol1AB |i71-i53 |BM111 | |WT |Pol2AB |i71-i54 |BM112 | |Pol 1 degraded |Pol2AB |i72-i51 |BM113 | |Pol 2 degraded |Pol2AB |i72-i52 |BM114 | |Pol 1 degraded |K27AcAB |i72-i53 |BM115 | |Pol 2 degraded |K27AcAB |i72-i54 |BM116 |

|Cells Rep2 | AB | Index | NuSeq ID | |---------------|--------|-----------|----------| |WT | Pol1AB | i73-i51 | BM117 | |Pol 1 degraded | Pol1AB | i73-i52 | BM118 | |Pol 2 degraded | Pol1AB | i73-i53 | BM119 | |WT | Pol2AB | i73-i54 | BM120 | |Pol 1 degraded | Pol2AB | i74-i51 | BM121 | |Pol 2 degraded | Pol2AB | i74-i52| BM122 | |Pol 1 degraded | K27AcAB | i74-i53 | BM123 | |Pol 2 degraded | K27AcAB | i74-i54 | BM124 |

07.19.2024 - Submission 3:

|Cells Rep1 | AB | Index | NuSeq ID | |---------------|--------|-----------|----------| |WT |Pol1AB |i71-i51 | BM125 | |Pol 1 degraded | Pol1AB | i71-i52 | BM126 | |Pol 2 degraded | Pol1AB | i71-i53 | BM127 | |WT | CTCFAB | i71-i54 | BM128 | |WT | CTCFAB | i72-i51 | BM129 | |WT | CTCFAB | i72-i52 | BM130 | |Pol 1 degraded | CTCFAB | i72-i53 | BM131 | |Pol 2 degraded | CTCFAB | i72-i54 | BM132 |

|Cells Rep2 | AB | Index | NuSeq ID | |---------------|--------|-----------|----------| | WT | K9AB | i73-i51 | BM133 | | Pol 1 degraded | K9AB | i73-i52 | BM134 | | Pol 2 degraded | K9AB | i73-i53 | BM135 | | WT | K9AB | i73-i54 | BM136 | | Pol 1 degraded | K9AB | i74-i51 | BM137 | | Pol 2 degraded | K9AB | i74-i52 | BM138 | | Pol 1 degraded | CTCFAB | i74-i53 | BM139 | | Pol 2 degraded | CTCFAB | i74-i54 | BM140 |

Cut&Tag uses Bowtie2 for alignment.

Build Bowtie indices for reference genome prior to alignment with Tophat using the following command bowtie2-build pathtoreferencegenome.fa prefixtonameindexes using default bowtie2 on quest, bowtie2/2.4.5

For mapping to rDNA repeats + human genome, use the following reference. For general mapping, use Ensembl

shell VERSION=108 wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz

OR UCSC: UCSC Genome and UCSC GTF (preferred). There are 4 different GTF references to use.

```shell wget -L https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/hg38.fa.gz wget -L https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz

```

example SLURM header for Quest HPC

```shell

!/bin/bash

SBATCH -A p32171 ##--## SLURM start

SBATCH -p genhimem

SBATCH -N 1

SBATCH --ntasks-per-node=12

SBATCH --mem-per-cpu=10G

SBATCH --mail-user=lucascarter2025@u.northwestern.edu

SBATCH --mail-type=BEGIN

SBATCH --mail-type=END

SBATCH -t 24:00:00

SBATCH --job-name=build_genome

SBATCH --output=outlog

SBATCH --error=errlog

```

```shell

-- header here --

cd /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/ref

module load bowtie2/2.4.5

bowtie2-build /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/ref/hg38.fa hg38 ```

GTF file can be found at /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/gtf and genome index can be found at /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/ref

Spike in normalization

I am using ChIPseqSpikeInFree to normalize the data in lieu of a spike-in. I installed the command line version at: /home/lmc0633/executables/ChIPseqSpikeInFree using the command version of the installation detailed at the github and R/4.3.0

Need a metadata file with the following columns:

ANTIBODY GROUPANTIBODY GROUP |ID | ANTIBODY | GROUP | |---------------|--------|-----------| |WT.bam |Pol1AB | control | |Pol1.bam |Pol1AB | pol1ko | |Pol2.bam |Pol1AB | pol2ko |

Code is written in such a way that I can run all BAMs for a particular group at once. I deposited metadata at /projects/b1042/BackmanLab/Lucas/090124_CnT/metadata as individual .txt files for each submission. each submission is labelled S1, S2, or S3, with corresponding fastqs and alignment files in each respective directory. This can be run on SLURM or it can be run on an interaction session. Script should look something like below:

Shell submission script (if submitting via SLURM):

```shell

module load R/4.3.0

Rscript --vanilla --verbose spike.R

```

R script for Spike In normalization factor. Must have more than 1 core enabled for this to work:

```R

HPC Script for normalization factor generation

library("ChIPseqSpikeInFree")

Must have more than 1 core enabled for this to work

library(BiocParallel) numCores <- parallel::detectCores() register(MulticoreParam(workers = numCores-1), default = TRUE)

n=2 ## 1, 2, or 3

Get metadata path (Must be tab delimited)

dir.meta <- "/projects/b1042/BackmanLab/Lucas/090124_CnT/metadata/" dir.s <- c("06032024.txt", "07012024.txt", "07192024.txt") meta <- paste0(dir.meta,dir.s[n])

Get bam paths

dir.bam <- "/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/" subdirs <- c("S1", "S2", "S3") bam.path <- paste0(dir.bam, subdirs[n],"/BAM/")

Get bam list

tbl <- read.table(meta, sep= "\t", header= T) bams <- tbl$ID

Write out txt files

out <- c("06032024.txt", "07012024.txt", "07192024.txt")

write.table(tbl, file = paste0(dir.meta, out[n]), quote=F, sep = "\t",row.names = F, col.names = TRUE)

library(stringr) input <- str_trim(paste0(bam.path,bams), side = c("both", "left", "right"))

Call function

ChIPseqSpikeInFree(bamFiles = input, chromFile = "hg38", metaFile = meta, prefix = paste0ls("/projects/b1042/BackmanLab/Lucas/090124_CnT/spikein/",subdirs[n]), ncores=4)

```

We need the size of each chromosome as a text file. this can be retrieved at UCSC: shell wget -L https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/hg38.chrom.sizes

Once we have the scaling factors from , we want to integrate them into generation of our coverage files like so:

shell libSize=`cat sample1.bed|wc -l` scale=15000000/($libSize*$SF) genomeCoverageBed -bg -scale $scale -i sample1.bed -g hg38.chromSizes > sample1.bedGraph bedGraphToBigWig sample1.bedGraph hg38.chromSizes sample1.bw OR

``shell scale_factor=echo "10000 / $seqDepth" | bc -l` echo "Scaling factor for $histName is: $scalefactor!" bedtools genomecov -bg -scale $scalefactor -i $projPath/alignment/bed/${histName}bowtie2.fragments.bed -g $chromSize > $projPath/alignment/bedgraph/${histName}bowtie2.fragments.normalized.bedgraph

```

Peak calling

There are a few options for calling peaks on Cut&Tag data. The more leniant and traditional option is MACS2. Using MACS2 will increase the false positive rate but will capture more peaks overall. SEACR was designed with Cut&Run/TAG data (high SN) in mind. More recent and less used, GoPeaks is also specifically designed for Cut&Run/TAG data. There are several other peak callers, but for my purposes, I will use MACS2 and SEACR. Here is a short explanation on the difference between SEACR and MACS2. This HBC workshop covers calling peaks on Cut&Run/TAG data.

For SEACR, we need to retrieve two scripts and put them in a directory for use later:

```shell mkdir SEACR ## in bin or scripts wget https://github.com/FredHutch/SEACR/raw/master/SEACR1.3.sh wget https://github.com/FredHutch/SEACR/raw/master/SEACR1.3.R

SEACR usage and example:shell

usage

bash SEACR_1.3.sh experimental bedgraph [control bedgraph | numeric threshold] ["norm" | "non"] ["relaxed" | "stringent"] output prefix

example

bash SEACR_1.3.sh target.bedgraph 0.01 non stringent output ```

For the script, we call the following: ```shell seacr=/home/lmc0633/executables/SEACR/SEACR1.3.sh bash $seacr $projPath/alignment/bedgraph/${histName}bowtie2.fragments.normalized.bedgraph 0.01 non stringent $projPath/peakCalling/SEACR/${histName}seacrtop0.01.peaks

```

```shell macs2 callpeak -t $projPath/alignment/bam/${org}/${time}/${hist}/bowtie2.mapped.bam \ -c $projPath/alignment/bam/${org}/${timeControl}/${hist}/bowtie2.mapped.bam \ -g ${genome} -f BAMPE -n macs2peakq0.1 --outdir $projPath/peakCalling/MACS2/${org}/${timeControl}control/${time}/${hist}/ -q 0.1 --keep-dup all 2>${projPath}/peakCalling/MACS2/${org}/${timeControl}control/${time}/${hist}/macs2Peak_summary.txt

```

```shell mkdir -p $projPath/peakCalling macs2 callpeak -t ${projPath}/alignment/bam/${histName}rep1bowtie2.mapped.bam \ -c ${projPath}/alignment/bam/${controlName}rep1bowtie2.mapped.bam \ -g hs -f BAMPE -n macs2peakq0.1 --outdir $projPath/peakCalling/MACS2 -q 0.1 --keep-dup all 2>${projPath}/peakCalling/MACS2/macs2Peak_summary.txt

``` double array in linux

```shell

!/bin/bash

array=( "Vietnam" "Germany" "Argentina" ) array2=( "Asia" "Europe" "America" )

for i in "${!array[@]}"; do printf "%s is in %s\n" "${array[i]}" "${array2[i]}" done

```

Script to call the EU-seq alignment script (Tophat) is below

```shell

!/bin/bash

SBATCH -A p32171 ##--## SLURM alt #SBATCH -A b1042

SBATCH -p genhimem

!/bin/bash

SBATCH -A b1042

SBATCH -p genomics

SBATCH -N 1

SBATCH --ntasks-per-node=16

SBATCH --mem-per-cpu=13G

SBATCH --mail-user=lucascarter2025@u.northwestern.edu

SBATCH --mail-type=BEGIN

SBATCH --mail-type=END

SBATCH -t 48:00:00

SBATCH --job-name=07.19.2024

SBATCH --output=outlog

SBATCH --error=errlog

-- header here --

echo -e "Usage: sh $0 -f -w -g -r -c -t -m -s -p < path to SEACR (optional)> -b < sets MAC2 to --broad (optional)> -d \n"

with MAC2

echo "Processing data" for file in *R1001.fastq.gz; do echo "${file}"; bash CutTag.sh -f ${file} -w /projects/b1042/BackmanLab/Lucas/090124CnT/fastq/S3/Backman267.19.2024/ -g /projects/b1042/BackmanLab/Lucas/090124CnT/genome/ref/hg38 -r /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/gtf/hg38.ncbiRefSeq.gtf -c hg38.chrom.sizes -t 16 -m 2 -d ; done

with SEACR

echo "Processing data" for file in *R1001.fastq.gz; do echo "${file}"; bash CutTag.sh -f ${file} -w /projects/b1042/BackmanLab/Lucas/090124CnT/fastq/test/ -g /projects/b1042/BackmanLab/Lucas/090124CnT/genome/ref/hg38 -r /projects/b1042/BackmanLab/Lucas/090124CnT/genome/gtf/hg38.ncbiRefSeq.gtf -p /home/lmc0633/executables/SEACR/SEACR_1.3.sh -c hg38.chrom.sizes -t 16 -m 2; done

```

For the correlation plot generation script, i adapted this code for use as follows:

```R

== R command ==

library(corrplot)

list.files('/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S1/BED/', pattern = ".bin.bed")

n=1 # c(1,2,3) path <- '/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/' subdir <- c("S1", "S2", "S3") dir <- paste0(path, subdir[n], "/BED/") hists <- list.files(dir, pattern = ".bin.bed")

reprod = c() fragCount = NULL for(hist in hists){

if(is.null(fragCount)){

fragCount = read.table(paste0(dir, hist), header = FALSE) 
colnames(fragCount) = c("chrom", "bin", hist)

}else{

fragCountTmp = read.table(paste0(dir, hist), header = FALSE)
colnames(fragCountTmp) = c("chrom", "bin", hist)
fragCount = full_join(fragCount, fragCountTmp, by = c("chrom", "bin"))

} }

M = cor(fragCount %>% select(-c("chrom", "bin")) %>% log2(), use = "complete.obs")

corrplot(M, method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 3, rect.col = "black", rect.lwd = 3,cl.pos = "b", tl.col = "indianred4", tl.cex = 0.5, cl.cex = 0.5, addCoef.col = "black", number.digits = 2, number.cex = 0.5, col = colorRampPalette(c("midnightblue","white","darkred"))(100))

``` Script for calling just SEACR peaks. Can't use NORM setting unless I have an input control

```shell

!/bin/bash

SBATCH -A p32171 ##--## SLURM start

SBATCH -p genhimem

SBATCH -N 1

SBATCH --ntasks-per-node=1

SBATCH --mem=200G

SBATCH --mail-user=lucascarter2025@u.northwestern.edu

SBATCH --mail-type=BEGIN

SBATCH --mail-type=END

SBATCH -t 24:00:00

SBATCH --job-name=build_genome

SBATCH --output=outlog

SBATCH --error=errlog

-- header here --

seacr=/home/lmc0633/executables/SEACR/SEACR1.3.sh bgfold=/projects/b1042/BackmanLab/Lucas/090124CnT/fastq/S3/BEDGRAPH pkfold=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S3/peaks

module load R/4.3.0 module load bedtools

bash /home/lmc0633/executables/SEACR/SEACR1.3.sh /projects/b1042/BackmanLab/Lucas/090124CnT/fastq/S3/BEDGRAPH/BM125WTS3.bedgraph 0.01 non stringent /projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S3/peaks/test.peaks cd ${bgfold}

echo "Processing data" for file in *.bedgraph; do echo "${file}";

P1=$(echo $file | sed 's/.bedgraph/.seacr.peaks/g')

echo "8. Calling SEACR Peaks on ${file}" echo -e "\t bash $seacr ${bgfold}/${file} 0.01 norm relaxed ${pkfold}/${P1}"

bash $seacr ${bgfold}/${file} 0.01 non stringent ${pkfold}/${P1}

done

echo -e "Peak calling complete \n"

```

Later modification

I looked into merging replicates using Coverage() in R and unfortunately, you lose all the metadate (qValue, pValue, signal) when doing so. I decided to merge my technical replicates at the FASTQ level and then call the peaks again on those. Following calling peaks, I plan to use Bedtools intersect to find the conensus peaks since it preserves the metadata and other information. Merging all technical replicates in submission 2 and in submission 3

```shell

out=/projects/b1042/BackmanLab/Lucas/090124CnT/fastq/merged/fastq/ in=/projects/b1042/BackmanLab/Lucas/090124CnT/fastq/S3/Backman26_7.19.2024/

read1=BM135Pol2degradedS13R2001.fastq.gz ; read2=BM138Pol2degradedS16R2001.fastq.gz ; file=Merged.K9AB.P2KO.S2R2001.fastq.gz

Merge input BAMS

cat ${in}${read1} ${in}${read2} > ${out}${file}

```

```shell

!/bin/bash

SBATCH -A b1042 ##--## SLURM start

SBATCH -p genomics

SBATCH -N 1

SBATCH --ntasks-per-node=10

SBATCH --mem-per-cpu=10G

SBATCH --mail-user=lucascarter2025@u.northwestern.edu

SBATCH --mail-type=BEGIN

SBATCH --mail-type=END

SBATCH -t 6:00:00

SBATCH --job-name=compute matrix

SBATCH --output=outlog

SBATCH --error=errlog

module load deeptools

== linux command ==

inpath=/projects/b1042/BackmanLab/Lucas/090124CnT/fastq/S2/peaks/ outpath=/projects/b1042/BackmanLab/Lucas/090124CnT/fastq/S2/peaks/test/ bgpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S2/coverage/

bgfile=BM109S1.rpgc.bigWig pkfile=BM109S1.seacr.peaks.relaxed.bed otfile=$(echo $pkfile | sed 's/.seacr.peaks.relaxed.bed/.seacr.peaks.summitRegion.bed/g') matfile=$(echo $pkfile | sed 's/.seacr.peaks.relaxed.bed/_SEACR.mat.gz/g')

label=P1AB_WT rep=S2 cores=10

awk '{split($6, summit, ":"); split(summit[2], region, "-"); print summit[1]"\t"region[1]"\t"region[2]}' $inpath$pkfile > $outpath$otfile

computeMatrix reference-point -S $bgpath$bgfile \ -R $outpath$otfile \ --skipZeros -o $outpath$matfile -p $cores -a 3000 -b 3000 --referencePoint center

plotHeatmap -m $outpath$matfile -out ${outpath}${matfile}SEACRheatmap.png --sortUsing sum --startLabel "Peak Start" -\ -endLabel "Peak End" --xAxisLabel "" --regionsLabel "Peaks" --samplesLabel "${label} ${rep}"

inpath=/projects/b1042/BackmanLab/Lucas/090124CnT/fastq/S2/peaks/ outpath=/projects/b1042/BackmanLab/Lucas/090124CnT/fastq/S2/peaks/test/ bgpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S2/coverage/

bgfile=BM110S2.rpgc.bigWig pkfile=BM110S2.seacr.peaks.relaxed.bed otfile=$(echo $pkfile | sed 's/.seacr.peaks.relaxed.bed/.seacr.peaks.summitRegion.bed/g') matfile=$(echo $pkfile | sed 's/.seacr.peaks.relaxed.bed/_SEACR.mat.gz/g')

label=P1AB_P1KO rep=S2

awk '{split($6, summit, ":"); split(summit[2], region, "-"); print summit[1]"\t"region[1]"\t"region[2]}' $inpath$pkfile > $outpath$otfile

computeMatrix reference-point -S $bgpath$bgfile \ -R $outpath$otfile \ --skipZeros -o $outpath$matfile -p $cores -a 3000 -b 3000 --referencePoint center

plotHeatmap -m $outpath$matfile -out ${outpath}${matfile}SEACRheatmap.png --sortUsing sum --startLabel "Peak Start" -\ -endLabel "Peak End" --xAxisLabel "" --regionsLabel "Peaks" --samplesLabel "${label} ${rep}"

```

downsampling reads

I'm downsampling my reads because some are over-sequenced and some are undersequenced. I worry this will contribute to variability between conditions, since some samples did experience overamplification. For this task, I am using seqtk

usage: seqtk sample -s <#ofsamples> >

First thing to do is count the total number of reads per file ```shell

!/bin/bash

SBATCH -A b1042 ##--## SLURM start

SBATCH -p genomics

SBATCH -N 1

SBATCH --ntasks-per-node=1

SBATCH --mem=50G

SBATCH --mail-user=lucascarter2025@u.northwestern.edu

SBATCH -t 2:00:00

SBATCH --job-name=count_reads

for file in *R1_001.fastq.gz; do echo "${file}"; READS=$(expr $(zcat ${file} | wc -l) / 4) echo -e "\t file: $file has $READS reads " echo -e "\t file: $file reads: $READS " >> reads.txt done ```

S1:

|File | Reads | Name | |------------------------------------|-------------------|-----------| | file: K4AB-Pol1S4R1001.fastq.gz | reads: 50,054,872 | K4ABP1KO | | file: K4AB-WTS3R1001.fastq.gz | reads: 68,741,393 | K4ABWT | | file: K9AB-Pol1S8R1001.fastq.gz | reads: 5,858,924 | K9ABP1KO | | file: K9AB-WTS7R1001.fastq.gz | reads: 2,249,616 | K9ABWT | | file: P1AB-Pol1S2R1001.fastq.gz | reads: 21,542,758 | P1ABP1KO | | file: P1AB-WTS1R1001.fastq.gz | reads: 4,246,499 | P1ABWT | | file: P2AB-Pol1S6R1001.fastq.gz | reads: 56,878,563 | P2ABP1KO | | file: P2AB-WTS5R1001.fastq.gz | reads: 21,233,193 | P2ABWT |

S2:

|File | Reads | Name | |----------------------------------|-----------------|-----------| | file: BM109S1R1001.fastq.gz |reads: 2,831,267 | P1ABWT | | file: BM110S2R1001.fastq.gz |reads: 1,997,534 | P1ABP1KO | | file: BM111S3R1001.fastq.gz |reads: 2,761,436 | P1ABP2KO | | file: BM112S4R1001.fastq.gz |reads: 2,904,866 | P2ABWT | | file: BM113S5R1001.fastq.gz |reads: 8,911,631 | P2ABP1KO | | file: BM114S6R1001.fastq.gz |reads: 3,923,244 | P2ABP2KO | | file: BM115S7R1001.fastq.gz |reads: 11,335,882| K27ABP1KO| | file: BM116S8R1001.fastq.gz |reads: 4,809,199 | K27ABP2KO| | file: BM117S9R1001.fastq.gz |reads: 408,680 | P1ABWT | | file: BM118S10R1001.fastq.gz |reads: 3,241,433 | P1ABP1KO | | file: BM119S11R1001.fastq.gz |reads: 3,001,093 | P1ABP2KO | | file: BM120S12R1001.fastq.gz |reads: 1,265,939 | P2ABWT | | file: BM121S13R1001.fastq.gz |reads: 4,573,860 | P2ABP1KO | | file: BM122S14R1001.fastq.gz |reads: 6,784,824 | P2ABP2KO | | file: BM123S15R1001.fastq.gz |reads: 2,668,701 | K27ABP1KO| | file: BM124S16R1001.fastq.gz |reads: 2,413,773 | K27ABP2KO|

Changing names to:

BM109S1R1001.fastq.gz P1ABWTS1R1001.fastq.gz BM110S2R1001.fastq.gz P1ABP1KOS2R1001.fastq.gz BM111S3R1001.fastq.gz P1ABP2KOS3R1001.fastq.gz BM112S4R1001.fastq.gz P2ABWTS4R1001.fastq.gz BM113S5R1001.fastq.gz P2ABP1KOS5R1001.fastq.gz BM114S6R1001.fastq.gz P2ABP2KOS6R1001.fastq.gz BM115S7R1001.fastq.gz K27ABP1KOS7R1001.fastq.gz BM116S8R1001.fastq.gz K27ABP2KOS8R1001.fastq.gz BM117S9R1001.fastq.gz P1ABWTS9R1001.fastq.gz BM118S10R1001.fastq.gz P1ABP1KOS10R1001.fastq.gz BM119S11R1001.fastq.gz P1ABP2KOS11R1001.fastq.gz BM120S12R1001.fastq.gz P2ABWTS12R1001.fastq.gz BM121S13R1001.fastq.gz P2ABP1KOS13R1001.fastq.gz BM122S14R1001.fastq.gz P2ABP2KOS14R1001.fastq.gz BM123S15R1001.fastq.gz K27ABP1KOS15R1001.fastq.gz BM124S16R1001.fastq.gz K27ABP2KOS16R1001.fastq.gz

S3:

|File | Reads | Name | |----------------------------------------------------|------------------|-----------| file: BM125WTS3R1001.fastq.gz | reads: 45,823,405 | P1ABWT file: BM126Pol1degradedS4R1001.fastq.gz | reads: 34,609,056 | P1ABP1KO file: BM127Pol2degradedS5R1001.fastq.gz | reads: 28,356,451 | P1ABP2KO file: BM128WTS6R1001.fastq.gz | reads: 34,639,433 | CTCFABWT file: BM129WTS7R1001.fastq.gz | reads: 49,503,419 | CTCFABWT file: BM130WTS8R1001.fastq.gz | reads: 47,559,479 | CTCFABWT file: BM131Pol1degradedS9R1001.fastq.gz | reads: 38,030,748 | CTCFABP1KO file: BM132Pol2degradedS10R1001.fastq.gz | reads: 28,916,144 | CTCFABP2KO file: BM133WTS11R1001.fastq.gz | reads: 63,835,066 | K9ABWT file: BM134Pol1degradedS12R1001.fastq.gz | reads: 54,217,202 | K9ABP1KO file: BM135Pol2degradedS13R1001.fastq.gz | reads: 53,024,658 | K9ABP2KO file: BM136WTS14R1001.fastq.gz | reads: 49,808,186 | K9ABWT file: BM137Pol1degradedS15R1001.fastq.gz | reads: 53,189,723 | K9ABP1KO file: BM138Pol2degradedS16R1001.fastq.gz | reads: 72,659,514 | K9ABP2KO file: BM139Pol1degradedS17R1001.fastq.gz | reads: 21,882,985 | CTCFABP1KO file: BM140Pol2degradedS18R1001.fastq.gz | reads: 23,661,905 | CTCFABP2KO

Changing names to:

mv BM125WTS3R2001.fastq.gz P1ABWTS3R2001.fastq.gz mv BM126Pol1degradedS4R2001.fastq.gz P1ABP1KOS4R2001.fastq.gz mv BM127Pol2degradedS5R2001.fastq.gz P1ABP2KOS5R2001.fastq.gz mv BM128WTS6R2001.fastq.gz CTCFABWTS6R2001.fastq.gz mv BM129WTS7R2001.fastq.gz CTCFABWTS7R2001.fastq.gz mv BM130WTS8R2001.fastq.gz CTCFABWTS8R2001.fastq.gz mv BM131Pol1degradedS9R2001.fastq.gz CTCFABP1KOS9R2001.fastq.gz mv BM132Pol2degradedS10R2001.fastq.gz CTCFABP2KOS10R2001.fastq.gz mv BM133WTS11R2001.fastq.gz K9ABWTS11R2001.fastq.gz mv BM134Pol1degradedS12R2001.fastq.gz K9ABP1KOS12R2001.fastq.gz mv BM135Pol2degradedS13R2001.fastq.gz K9ABP2KOS13R2001.fastq.gz mv BM136WTS14R2001.fastq.gz K9ABWTS14R2001.fastq.gz mv BM137Pol1degradedS15R2001.fastq.gz K9ABP1KOS15R2001.fastq.gz mv BM138Pol2degradedS16R2001.fastq.gz K9ABP2KOS16R2001.fastq.gz mv BM139Pol1degradedS17R2001.fastq.gz CTCFABP1KOS17R2001.fastq.gz mv BM140Pol2degradedS18R2001.fastq.gz CTCFABP2KOS18R2001.fastq.gz

Then we can normalize based on read #

```shell

!/bin/bash

SBATCH -A p32171 ##--## SLURM start

SBATCH -p short

SBATCH -N 1

SBATCH --ntasks-per-node=1

SBATCH --mem=150G

SBATCH --mail-user=lucascarter2025@u.northwestern.edu

SBATCH --mail-type=BEGIN

SBATCH --mail-type=END

SBATCH -t 3:00:00

SBATCH --job-name=downsample

SBATCH --output=outlog

SBATCH --error=errlog

cutoff=1000000 norm=2000000 highnorm=10000000 seed=100

Paths

inpath=$1 ## usage sbatch outpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/sampled/fastq/

module load seqtk/Dec17

for file in *R1_001.fastq.gz; do echo "${file}";

File names

reads=$(expr $(zcat ${file} | wc -l) / 4) R2=$(echo $file | sed 's/R1001.fastq.gz/R2001.fastq.gz/g') R1out=$(echo $file | sed 's/R1001.fastq.gz/sampledR1001.fastq/g') R2out=$(echo $R2 | sed 's/R2001.fastq.gz/sampledR2001.fastq/g')

if [[ $reads -lt $cutoff ]] ## If there is no path to seacr software, use MACS2 then

echo -e "\t Sample ${file} has too few reads: ${reads} \n"

elif [[ $reads -lt $norm ]] && [[ $reads -gt $cutoff ]] then

echo -e "\t Sample ${file} with ${reads} reads: sampled to 1,500,000 " echo -e "\t seqtk sample -s $seed ${inpath}${file} 1500000 > ${outpath}${R1out} \n"

seqtk sample -s $seed ${inpath}${file} 1500000 > ${outpath}${R1out} seqtk sample -s $seed ${inpath}${R2} 1500000 > ${outpath}${R2out}

elif [[ $reads -gt $highnorm ]] then

echo -e "\t Sample ${file} with ${reads} reads: sampled to $highnorm " echo -e "\t seqtk sample -s $seed ${inpath}${file} $highnorm > ${outpath}${R1out} \n"

seqtk sample -s $seed ${inpath}${file} $highnorm > ${outpath}${R1out} seqtk sample -s $seed ${inpath}${R2} $highnorm > ${outpath}${R2out}

else

echo -e "\t Sample ${file} with ${reads} reads: sampled to $norm " echo -e "\t seqtk sample -s $seed ${inpath}${file} $norm > ${outpath}${R1out} \n"

seqtk sample -s $seed ${inpath}${file} $norm > ${outpath}${R1out} seqtk sample -s $seed ${inpath}${R2} $norm > ${outpath}${R2out}

fi done

gzip *.fastq

```

I checked the duplication rate using the following code and found that the rates for each sample were quite high (consistent with overamplication by PCR (my fault) and oversequencing by the core (NuSeq's fault)). This publication addresses these issues and found similar duplication issues. Here is the code I used:

```R

Summarize the duplication information from the picard summary outputs.

n=3 dir.sam <- "/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/" subdirs <- c("S1", "S2", "S3") sam.path <- paste0(dir.sam, subdirs[n],"/SAM/") hists <- list.files(sam.path, pattern = ".dupMark.txt")

dupResult = c() for(hist in hists){ dupRes = read.table(paste0(sam.path,hist), header = TRUE, fill = TRUE)

histInfo = strsplit(hist, "")[[1]] dupResult = data.frame(Histone = histInfo[1], Replicate = histInfo[2], MappedFragNumhg38 = dupRes$READPAIRSEXAMINED[1] %>% as.character %>% as.numeric, DuplicationRate = dupRes$PERCENTDUPLICATION[1] %>% as.character %>% as.numeric * 100, EstimatedLibrarySize = dupRes$ESTIMATEDLIBRARYSIZE[1] %>% as.character %>% as.numeric) %>% mutate(UniqueFragNum = MappedFragNumhg38 * (1-DuplicationRate/100)) %>% rbind(dupResult, .) }

```

Owner

Name: Backman Lab
Login: BackmanLab
Kind: organization
Location: United States of America

Repositories: 2
Profile: https://github.com/BackmanLab

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

https://github.com/backmanlab/cut-tag

Science Score: 36.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Cut&Tag Data Processing Guide

Intro:

!/bin/bash

SBATCH -A p32171 ##--## SLURM start

SBATCH -p genhimem

SBATCH -N 1

SBATCH --ntasks-per-node=12

SBATCH --mem-per-cpu=10G

SBATCH --mail-user=lucascarter2025@u.northwestern.edu

SBATCH --mail-type=BEGIN

SBATCH --mail-type=END

SBATCH -t 24:00:00

SBATCH --job-name=build_genome

SBATCH --output=outlog

SBATCH --error=errlog

-- header here --

Spike in normalization

HPC Script for normalization factor generation

Must have more than 1 core enabled for this to work

Get metadata path (Must be tab delimited)

Get bam paths

Get bam list

Write out txt files

out <- c("06032024.txt", "07012024.txt", "07192024.txt")

write.table(tbl, file = paste0(dir.meta, out[n]), quote=F, sep = "\t",row.names = F, col.names = TRUE)

Call function

Peak calling

usage

example

!/bin/bash

!/bin/bash

SBATCH -A p32171 ##--## SLURM alt #SBATCH -A b1042

SBATCH -p genhimem

!/bin/bash

SBATCH -A b1042

SBATCH -p genomics

SBATCH -N 1

SBATCH --ntasks-per-node=16

SBATCH --mem-per-cpu=13G

SBATCH --mail-user=lucascarter2025@u.northwestern.edu

SBATCH --mail-type=BEGIN

SBATCH --mail-type=END

SBATCH -t 48:00:00

SBATCH --job-name=07.19.2024

SBATCH --output=outlog

SBATCH --error=errlog

-- header here --

with MAC2

with SEACR

== R command ==

!/bin/bash

SBATCH -A p32171 ##--## SLURM start

SBATCH -p genhimem

SBATCH -N 1

SBATCH --ntasks-per-node=1

SBATCH --mem=200G

SBATCH --mail-user=lucascarter2025@u.northwestern.edu

SBATCH --mail-type=BEGIN

SBATCH --mail-type=END

SBATCH -t 24:00:00

SBATCH --job-name=build_genome

SBATCH --output=outlog

SBATCH --error=errlog

-- header here --

Later modification

Merge input BAMS

!/bin/bash

SBATCH -A b1042 ##--## SLURM start

SBATCH -p genomics

SBATCH -N 1

SBATCH --ntasks-per-node=10

SBATCH --mem-per-cpu=10G

SBATCH --mail-user=lucascarter2025@u.northwestern.edu