https://github.com/backmanlab/cut-tag
A repository to hold code for Backman Lab Cut&Tag data analysis
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 7 DOI reference(s) in README -
✓Academic publication links
Links to: biorxiv.org, springer.com, nature.com, science.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.9%) to scientific vocabulary
Repository
A repository to hold code for Backman Lab Cut&Tag data analysis
Basic Info
- Host: GitHub
- Owner: BackmanLab
- Language: Shell
- Default Branch: main
- Size: 43 KB
Statistics
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Cut&Tag Data Processing Guide
Intro:
Cleavage Under Targets & Tagmentation (CUT&Tag)is an ChIP-like profiling strategy in which primary antibodies are bound to chromatin proteins in situ after nuclei have been permeabilized. Secondary antibodies covalently bound to the cut-and-paste transposase Tn5 are then bound to the primary antibody. Activation of the transposase simultaneously digests DNA and adds high throughput sequencing adapters (referred to as ‘tagmentation’) for paired-end DNA sequencing. Cut&Tag libraries have a much higher signal-to-noise ratio than traditional ChIP libraries and need far fewer cells as well.
Below is documentation for set-up and a single script for parsing Cut&Tag data based off of the data analysis protocol presented here from the Cut&Tag protocol and paper. Additionally, there is an nf-core pipeline for Cut&Tag. Because I want more control over the processing of the data, I will write my own wrapper based on the Henikoff Lab's data analysis protocol.
Here are related papers and resources for Cut&Tag: - A 2023 review on Cut&Tag - Integrative Analysis of CUT&Tag and RNA-Seq Data Through Bioinformatics: A Unified Workflow for Enhanced Insights - CUT&RUNTools 2.0: a pipeline for single-cell and bulk-level CUT&RUN and CUT&Tag data analysis - Spatial-CUT&Tag: Spatially resolved chromatin modification profiling at the cellular level - Active Motif's Spike in Strategy - GoPeaks: histone modification peak calling for CUT&Tag - CUT&Tag recovers up to half of ENCODE ChIP-seq peaks in modifications of H3K27 - ChIPseqSpikeInFree: a package for Spike-in Free analysis - Peak calling by Sparse Enrichment Analysis for CUT&RUN chromatin profiling - Another Henikoff tutorial on Cut&Tag Analysis - Really good resource on Bash operators
I submitted three groups of Cut&Tag biological replicates on 6.03.2024, 07.01.2024, and 7.19.2024. The later two submissions have technical replicates in them that should be combined. The sample identifiers for each are below:
6.03.2024 - Submission 1:
|Cells | AB | Index | NuSeq ID | |---------------|--------|-----------|----------| | WT | Pol1AB | i71-i51 | BM101 | | Pol1 degraded | Pol1AB | i71-i52 | BM102 | | Pol1 degraded | K4Me3AB| i71-i53 | BM104 | | WT |K4Me3AB | i71-i54 |BM103 | | WT |Pol2AB |i72-i51 |BM105 | | Pol 1 degraded|Pol2AB |i72-i52 |BM106 | | WT |K9Me3AB |i72-i53 |BM107 | | Pol 1 degraded|K9Me3AB |i72-i54 |BM108 |
07.01.2024 - Submission 2:
|Cells Rep1 | AB | Index | NuSeq ID | |---------------|--------|-----------|----------| |WT |Pol1AB |i71-i51 |BM109 | |Pol 1 degraded |Pol1AB |i71-i52 |BM110 | |Pol 2 degraded |Pol1AB |i71-i53 |BM111 | |WT |Pol2AB |i71-i54 |BM112 | |Pol 1 degraded |Pol2AB |i72-i51 |BM113 | |Pol 2 degraded |Pol2AB |i72-i52 |BM114 | |Pol 1 degraded |K27AcAB |i72-i53 |BM115 | |Pol 2 degraded |K27AcAB |i72-i54 |BM116 |
|Cells Rep2 | AB | Index | NuSeq ID | |---------------|--------|-----------|----------| |WT | Pol1AB | i73-i51 | BM117 | |Pol 1 degraded | Pol1AB | i73-i52 | BM118 | |Pol 2 degraded | Pol1AB | i73-i53 | BM119 | |WT | Pol2AB | i73-i54 | BM120 | |Pol 1 degraded | Pol2AB | i74-i51 | BM121 | |Pol 2 degraded | Pol2AB | i74-i52| BM122 | |Pol 1 degraded | K27AcAB | i74-i53 | BM123 | |Pol 2 degraded | K27AcAB | i74-i54 | BM124 |
07.19.2024 - Submission 3:
|Cells Rep1 | AB | Index | NuSeq ID | |---------------|--------|-----------|----------| |WT |Pol1AB |i71-i51 | BM125 | |Pol 1 degraded | Pol1AB | i71-i52 | BM126 | |Pol 2 degraded | Pol1AB | i71-i53 | BM127 | |WT | CTCFAB | i71-i54 | BM128 | |WT | CTCFAB | i72-i51 | BM129 | |WT | CTCFAB | i72-i52 | BM130 | |Pol 1 degraded | CTCFAB | i72-i53 | BM131 | |Pol 2 degraded | CTCFAB | i72-i54 | BM132 |
|Cells Rep2 | AB | Index | NuSeq ID | |---------------|--------|-----------|----------| | WT | K9AB | i73-i51 | BM133 | | Pol 1 degraded | K9AB | i73-i52 | BM134 | | Pol 2 degraded | K9AB | i73-i53 | BM135 | | WT | K9AB | i73-i54 | BM136 | | Pol 1 degraded | K9AB | i74-i51 | BM137 | | Pol 2 degraded | K9AB | i74-i52 | BM138 | | Pol 1 degraded | CTCFAB | i74-i53 | BM139 | | Pol 2 degraded | CTCFAB | i74-i54 | BM140 |
Cut&Tag uses Bowtie2 for alignment.
Build Bowtie indices for reference genome prior to alignment with Tophat using the following command bowtie2-build pathtoreferencegenome.fa prefixtonameindexes using default bowtie2 on quest, bowtie2/2.4.5
For mapping to rDNA repeats + human genome, use the following reference. For general mapping, use Ensembl
shell
VERSION=108
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz
OR UCSC: UCSC Genome and UCSC GTF (preferred). There are 4 different GTF references to use.
```shell wget -L https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/hg38.fa.gz wget -L https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz
```
example SLURM header for Quest HPC
```shell
!/bin/bash
SBATCH -A p32171 ##--## SLURM start
SBATCH -p genhimem
SBATCH -N 1
SBATCH --ntasks-per-node=12
SBATCH --mem-per-cpu=10G
SBATCH --mail-user=lucascarter2025@u.northwestern.edu
SBATCH --mail-type=BEGIN
SBATCH --mail-type=END
SBATCH -t 24:00:00
SBATCH --job-name=build_genome
SBATCH --output=outlog
SBATCH --error=errlog
```
```shell
-- header here --
cd /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/ref
module load bowtie2/2.4.5
bowtie2-build /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/ref/hg38.fa hg38 ```
GTF file can be found at /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/gtf and genome index can be found at /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/ref
Spike in normalization
I am using ChIPseqSpikeInFree to normalize the data in lieu of a spike-in. I installed the command line version at:
/home/lmc0633/executables/ChIPseqSpikeInFree using the command version of the installation detailed at the github and R/4.3.0
Need a metadata file with the following columns:
ANTIBODY GROUPANTIBODY GROUP |ID | ANTIBODY | GROUP | |---------------|--------|-----------| |WT.bam |Pol1AB | control | |Pol1.bam |Pol1AB | pol1ko | |Pol2.bam |Pol1AB | pol2ko |
Code is written in such a way that I can run all BAMs for a particular group at once. I deposited metadata at /projects/b1042/BackmanLab/Lucas/090124_CnT/metadata as individual .txt files for each submission. each submission is labelled S1, S2, or S3, with corresponding fastqs and alignment files in each respective directory. This can be run on SLURM or it can be run on an interaction session. Script should look something like below:
Shell submission script (if submitting via SLURM):
```shell
module load R/4.3.0
Rscript --vanilla --verbose spike.R
```
R script for Spike In normalization factor. Must have more than 1 core enabled for this to work:
```R
HPC Script for normalization factor generation
library("ChIPseqSpikeInFree")
Must have more than 1 core enabled for this to work
library(BiocParallel) numCores <- parallel::detectCores() register(MulticoreParam(workers = numCores-1), default = TRUE)
n=2 ## 1, 2, or 3
Get metadata path (Must be tab delimited)
dir.meta <- "/projects/b1042/BackmanLab/Lucas/090124_CnT/metadata/" dir.s <- c("06032024.txt", "07012024.txt", "07192024.txt") meta <- paste0(dir.meta,dir.s[n])
Get bam paths
dir.bam <- "/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/" subdirs <- c("S1", "S2", "S3") bam.path <- paste0(dir.bam, subdirs[n],"/BAM/")
Get bam list
tbl <- read.table(meta, sep= "\t", header= T) bams <- tbl$ID
Write out txt files
out <- c("06032024.txt", "07012024.txt", "07192024.txt")
write.table(tbl, file = paste0(dir.meta, out[n]), quote=F, sep = "\t",row.names = F, col.names = TRUE)
library(stringr) input <- str_trim(paste0(bam.path,bams), side = c("both", "left", "right"))
Call function
ChIPseqSpikeInFree(bamFiles = input, chromFile = "hg38", metaFile = meta, prefix = paste0ls("/projects/b1042/BackmanLab/Lucas/090124_CnT/spikein/",subdirs[n]), ncores=4)
```
We need the size of each chromosome as a text file. this can be retrieved at UCSC:
shell
wget -L https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/hg38.chrom.sizes
Once we have the scaling factors from , we want to integrate them into generation of our coverage files like so:
shell
libSize=`cat sample1.bed|wc -l`
scale=15000000/($libSize*$SF)
genomeCoverageBed -bg -scale $scale -i sample1.bed -g hg38.chromSizes > sample1.bedGraph
bedGraphToBigWig sample1.bedGraph hg38.chromSizes sample1.bw
OR
``shell
scale_factor=echo "10000 / $seqDepth" | bc -l`
echo "Scaling factor for $histName is: $scalefactor!"
bedtools genomecov -bg -scale $scalefactor -i $projPath/alignment/bed/${histName}bowtie2.fragments.bed -g $chromSize > $projPath/alignment/bedgraph/${histName}bowtie2.fragments.normalized.bedgraph
```
Peak calling
There are a few options for calling peaks on Cut&Tag data. The more leniant and traditional option is MACS2. Using MACS2 will increase the false positive rate but will capture more peaks overall. SEACR was designed with Cut&Run/TAG data (high SN) in mind. More recent and less used, GoPeaks is also specifically designed for Cut&Run/TAG data. There are several other peak callers, but for my purposes, I will use MACS2 and SEACR. Here is a short explanation on the difference between SEACR and MACS2. This HBC workshop covers calling peaks on Cut&Run/TAG data.
For SEACR, we need to retrieve two scripts and put them in a directory for use later:
```shell mkdir SEACR ## in bin or scripts wget https://github.com/FredHutch/SEACR/raw/master/SEACR1.3.sh wget https://github.com/FredHutch/SEACR/raw/master/SEACR1.3.R
SEACR usage and example:
shell
usage
bash SEACR_1.3.sh experimental bedgraph [control bedgraph | numeric threshold] ["norm" | "non"] ["relaxed" | "stringent"] output prefix
example
bash SEACR_1.3.sh target.bedgraph 0.01 non stringent output ```
For the script, we call the following: ```shell seacr=/home/lmc0633/executables/SEACR/SEACR1.3.sh bash $seacr $projPath/alignment/bedgraph/${histName}bowtie2.fragments.normalized.bedgraph 0.01 non stringent $projPath/peakCalling/SEACR/${histName}seacrtop0.01.peaks
```
```shell macs2 callpeak -t $projPath/alignment/bam/${org}/${time}/${hist}/bowtie2.mapped.bam \ -c $projPath/alignment/bam/${org}/${timeControl}/${hist}/bowtie2.mapped.bam \ -g ${genome} -f BAMPE -n macs2peakq0.1 --outdir $projPath/peakCalling/MACS2/${org}/${timeControl}control/${time}/${hist}/ -q 0.1 --keep-dup all 2>${projPath}/peakCalling/MACS2/${org}/${timeControl}control/${time}/${hist}/macs2Peak_summary.txt
```
```shell mkdir -p $projPath/peakCalling macs2 callpeak -t ${projPath}/alignment/bam/${histName}rep1bowtie2.mapped.bam \ -c ${projPath}/alignment/bam/${controlName}rep1bowtie2.mapped.bam \ -g hs -f BAMPE -n macs2peakq0.1 --outdir $projPath/peakCalling/MACS2 -q 0.1 --keep-dup all 2>${projPath}/peakCalling/MACS2/macs2Peak_summary.txt
``` double array in linux
```shell
!/bin/bash
array=( "Vietnam" "Germany" "Argentina" ) array2=( "Asia" "Europe" "America" )
for i in "${!array[@]}"; do printf "%s is in %s\n" "${array[i]}" "${array2[i]}" done
```
Script to call the EU-seq alignment script (Tophat) is below
```shell
!/bin/bash
SBATCH -A p32171 ##--## SLURM alt #SBATCH -A b1042
SBATCH -p genhimem
!/bin/bash
SBATCH -A b1042
SBATCH -p genomics
SBATCH -N 1
SBATCH --ntasks-per-node=16
SBATCH --mem-per-cpu=13G
SBATCH --mail-user=lucascarter2025@u.northwestern.edu
SBATCH --mail-type=BEGIN
SBATCH --mail-type=END
SBATCH -t 48:00:00
SBATCH --job-name=07.19.2024
SBATCH --output=outlog
SBATCH --error=errlog
-- header here --
echo -e "Usage: sh $0 -f
with MAC2
echo "Processing data" for file in *R1001.fastq.gz; do echo "${file}"; bash CutTag.sh -f ${file} -w /projects/b1042/BackmanLab/Lucas/090124CnT/fastq/S3/Backman267.19.2024/ -g /projects/b1042/BackmanLab/Lucas/090124CnT/genome/ref/hg38 -r /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/gtf/hg38.ncbiRefSeq.gtf -c hg38.chrom.sizes -t 16 -m 2 -d ; done
with SEACR
echo "Processing data" for file in *R1001.fastq.gz; do echo "${file}"; bash CutTag.sh -f ${file} -w /projects/b1042/BackmanLab/Lucas/090124CnT/fastq/test/ -g /projects/b1042/BackmanLab/Lucas/090124CnT/genome/ref/hg38 -r /projects/b1042/BackmanLab/Lucas/090124CnT/genome/gtf/hg38.ncbiRefSeq.gtf -p /home/lmc0633/executables/SEACR/SEACR_1.3.sh -c hg38.chrom.sizes -t 16 -m 2; done
```
For the correlation plot generation script, i adapted this code for use as follows:
```R
== R command ==
library(corrplot)
list.files('/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S1/BED/', pattern = ".bin.bed")
n=1 # c(1,2,3) path <- '/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/' subdir <- c("S1", "S2", "S3") dir <- paste0(path, subdir[n], "/BED/") hists <- list.files(dir, pattern = ".bin.bed")
reprod = c() fragCount = NULL for(hist in hists){
if(is.null(fragCount)){
fragCount = read.table(paste0(dir, hist), header = FALSE)
colnames(fragCount) = c("chrom", "bin", hist)
}else{
fragCountTmp = read.table(paste0(dir, hist), header = FALSE)
colnames(fragCountTmp) = c("chrom", "bin", hist)
fragCount = full_join(fragCount, fragCountTmp, by = c("chrom", "bin"))
} }
M = cor(fragCount %>% select(-c("chrom", "bin")) %>% log2(), use = "complete.obs")
corrplot(M, method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 3, rect.col = "black", rect.lwd = 3,cl.pos = "b", tl.col = "indianred4", tl.cex = 0.5, cl.cex = 0.5, addCoef.col = "black", number.digits = 2, number.cex = 0.5, col = colorRampPalette(c("midnightblue","white","darkred"))(100))
``` Script for calling just SEACR peaks. Can't use NORM setting unless I have an input control
```shell
!/bin/bash
SBATCH -A p32171 ##--## SLURM start
SBATCH -p genhimem
SBATCH -N 1
SBATCH --ntasks-per-node=1
SBATCH --mem=200G
SBATCH --mail-user=lucascarter2025@u.northwestern.edu
SBATCH --mail-type=BEGIN
SBATCH --mail-type=END
SBATCH -t 24:00:00
SBATCH --job-name=build_genome
SBATCH --output=outlog
SBATCH --error=errlog
-- header here --
seacr=/home/lmc0633/executables/SEACR/SEACR1.3.sh bgfold=/projects/b1042/BackmanLab/Lucas/090124CnT/fastq/S3/BEDGRAPH pkfold=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S3/peaks
module load R/4.3.0 module load bedtools
bash /home/lmc0633/executables/SEACR/SEACR1.3.sh /projects/b1042/BackmanLab/Lucas/090124CnT/fastq/S3/BEDGRAPH/BM125WTS3.bedgraph 0.01 non stringent /projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S3/peaks/test.peaks cd ${bgfold}
echo "Processing data" for file in *.bedgraph; do echo "${file}";
P1=$(echo $file | sed 's/.bedgraph/.seacr.peaks/g')
echo "8. Calling SEACR Peaks on ${file}" echo -e "\t bash $seacr ${bgfold}/${file} 0.01 norm relaxed ${pkfold}/${P1}"
bash $seacr ${bgfold}/${file} 0.01 non stringent ${pkfold}/${P1}
done
echo -e "Peak calling complete \n"
```
Later modification
I looked into merging replicates using Coverage() in R and unfortunately, you lose all the metadate (qValue, pValue, signal) when doing so. I decided to merge my technical replicates at the FASTQ level and then call the peaks again on those. Following calling peaks, I plan to use Bedtools intersect to find the conensus peaks since it preserves the metadata and other information. Merging all technical replicates in submission 2 and in submission 3
```shell
out=/projects/b1042/BackmanLab/Lucas/090124CnT/fastq/merged/fastq/ in=/projects/b1042/BackmanLab/Lucas/090124CnT/fastq/S3/Backman26_7.19.2024/
read1=BM135Pol2degradedS13R2001.fastq.gz ; read2=BM138Pol2degradedS16R2001.fastq.gz ; file=Merged.K9AB.P2KO.S2R2001.fastq.gz
Merge input BAMS
cat ${in}${read1} ${in}${read2} > ${out}${file}
```
```shell
!/bin/bash
SBATCH -A b1042 ##--## SLURM start
SBATCH -p genomics
SBATCH -N 1
SBATCH --ntasks-per-node=10
SBATCH --mem-per-cpu=10G
SBATCH --mail-user=lucascarter2025@u.northwestern.edu
SBATCH --mail-type=BEGIN
SBATCH --mail-type=END
SBATCH -t 6:00:00
SBATCH --job-name=compute matrix
SBATCH --output=outlog
SBATCH --error=errlog
module load deeptools
== linux command ==
inpath=/projects/b1042/BackmanLab/Lucas/090124CnT/fastq/S2/peaks/ outpath=/projects/b1042/BackmanLab/Lucas/090124CnT/fastq/S2/peaks/test/ bgpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S2/coverage/
bgfile=BM109S1.rpgc.bigWig pkfile=BM109S1.seacr.peaks.relaxed.bed otfile=$(echo $pkfile | sed 's/.seacr.peaks.relaxed.bed/.seacr.peaks.summitRegion.bed/g') matfile=$(echo $pkfile | sed 's/.seacr.peaks.relaxed.bed/_SEACR.mat.gz/g')
label=P1AB_WT rep=S2 cores=10
awk '{split($6, summit, ":"); split(summit[2], region, "-"); print summit[1]"\t"region[1]"\t"region[2]}' $inpath$pkfile > $outpath$otfile
computeMatrix reference-point -S $bgpath$bgfile \ -R $outpath$otfile \ --skipZeros -o $outpath$matfile -p $cores -a 3000 -b 3000 --referencePoint center
plotHeatmap -m $outpath$matfile -out ${outpath}${matfile}SEACRheatmap.png --sortUsing sum --startLabel "Peak Start" -\ -endLabel "Peak End" --xAxisLabel "" --regionsLabel "Peaks" --samplesLabel "${label} ${rep}"
inpath=/projects/b1042/BackmanLab/Lucas/090124CnT/fastq/S2/peaks/ outpath=/projects/b1042/BackmanLab/Lucas/090124CnT/fastq/S2/peaks/test/ bgpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S2/coverage/
bgfile=BM110S2.rpgc.bigWig pkfile=BM110S2.seacr.peaks.relaxed.bed otfile=$(echo $pkfile | sed 's/.seacr.peaks.relaxed.bed/.seacr.peaks.summitRegion.bed/g') matfile=$(echo $pkfile | sed 's/.seacr.peaks.relaxed.bed/_SEACR.mat.gz/g')
label=P1AB_P1KO rep=S2
awk '{split($6, summit, ":"); split(summit[2], region, "-"); print summit[1]"\t"region[1]"\t"region[2]}' $inpath$pkfile > $outpath$otfile
computeMatrix reference-point -S $bgpath$bgfile \ -R $outpath$otfile \ --skipZeros -o $outpath$matfile -p $cores -a 3000 -b 3000 --referencePoint center
plotHeatmap -m $outpath$matfile -out ${outpath}${matfile}SEACRheatmap.png --sortUsing sum --startLabel "Peak Start" -\ -endLabel "Peak End" --xAxisLabel "" --regionsLabel "Peaks" --samplesLabel "${label} ${rep}"
```
downsampling reads
I'm downsampling my reads because some are over-sequenced and some are undersequenced. I worry this will contribute to variability between conditions, since some samples did experience overamplification. For this task, I am using seqtk
usage: seqtk sample -s
First thing to do is count the total number of reads per file ```shell
!/bin/bash
SBATCH -A b1042 ##--## SLURM start
SBATCH -p genomics
SBATCH -N 1
SBATCH --ntasks-per-node=1
SBATCH --mem=50G
SBATCH --mail-user=lucascarter2025@u.northwestern.edu
SBATCH -t 2:00:00
SBATCH --job-name=count_reads
for file in *R1_001.fastq.gz; do echo "${file}"; READS=$(expr $(zcat ${file} | wc -l) / 4) echo -e "\t file: $file has $READS reads " echo -e "\t file: $file reads: $READS " >> reads.txt done ```
S1:
|File | Reads | Name | |------------------------------------|-------------------|-----------| | file: K4AB-Pol1S4R1001.fastq.gz | reads: 50,054,872 | K4ABP1KO | | file: K4AB-WTS3R1001.fastq.gz | reads: 68,741,393 | K4ABWT | | file: K9AB-Pol1S8R1001.fastq.gz | reads: 5,858,924 | K9ABP1KO | | file: K9AB-WTS7R1001.fastq.gz | reads: 2,249,616 | K9ABWT | | file: P1AB-Pol1S2R1001.fastq.gz | reads: 21,542,758 | P1ABP1KO | | file: P1AB-WTS1R1001.fastq.gz | reads: 4,246,499 | P1ABWT | | file: P2AB-Pol1S6R1001.fastq.gz | reads: 56,878,563 | P2ABP1KO | | file: P2AB-WTS5R1001.fastq.gz | reads: 21,233,193 | P2ABWT |
S2:
|File | Reads | Name | |----------------------------------|-----------------|-----------| | file: BM109S1R1001.fastq.gz |reads: 2,831,267 | P1ABWT | | file: BM110S2R1001.fastq.gz |reads: 1,997,534 | P1ABP1KO | | file: BM111S3R1001.fastq.gz |reads: 2,761,436 | P1ABP2KO | | file: BM112S4R1001.fastq.gz |reads: 2,904,866 | P2ABWT | | file: BM113S5R1001.fastq.gz |reads: 8,911,631 | P2ABP1KO | | file: BM114S6R1001.fastq.gz |reads: 3,923,244 | P2ABP2KO | | file: BM115S7R1001.fastq.gz |reads: 11,335,882| K27ABP1KO| | file: BM116S8R1001.fastq.gz |reads: 4,809,199 | K27ABP2KO| | file: BM117S9R1001.fastq.gz |reads: 408,680 | P1ABWT | | file: BM118S10R1001.fastq.gz |reads: 3,241,433 | P1ABP1KO | | file: BM119S11R1001.fastq.gz |reads: 3,001,093 | P1ABP2KO | | file: BM120S12R1001.fastq.gz |reads: 1,265,939 | P2ABWT | | file: BM121S13R1001.fastq.gz |reads: 4,573,860 | P2ABP1KO | | file: BM122S14R1001.fastq.gz |reads: 6,784,824 | P2ABP2KO | | file: BM123S15R1001.fastq.gz |reads: 2,668,701 | K27ABP1KO| | file: BM124S16R1001.fastq.gz |reads: 2,413,773 | K27ABP2KO|
Changing names to:
BM109S1R1001.fastq.gz P1ABWTS1R1001.fastq.gz BM110S2R1001.fastq.gz P1ABP1KOS2R1001.fastq.gz BM111S3R1001.fastq.gz P1ABP2KOS3R1001.fastq.gz BM112S4R1001.fastq.gz P2ABWTS4R1001.fastq.gz BM113S5R1001.fastq.gz P2ABP1KOS5R1001.fastq.gz BM114S6R1001.fastq.gz P2ABP2KOS6R1001.fastq.gz BM115S7R1001.fastq.gz K27ABP1KOS7R1001.fastq.gz BM116S8R1001.fastq.gz K27ABP2KOS8R1001.fastq.gz BM117S9R1001.fastq.gz P1ABWTS9R1001.fastq.gz BM118S10R1001.fastq.gz P1ABP1KOS10R1001.fastq.gz BM119S11R1001.fastq.gz P1ABP2KOS11R1001.fastq.gz BM120S12R1001.fastq.gz P2ABWTS12R1001.fastq.gz BM121S13R1001.fastq.gz P2ABP1KOS13R1001.fastq.gz BM122S14R1001.fastq.gz P2ABP2KOS14R1001.fastq.gz BM123S15R1001.fastq.gz K27ABP1KOS15R1001.fastq.gz BM124S16R1001.fastq.gz K27ABP2KOS16R1001.fastq.gz
S3:
|File | Reads | Name | |----------------------------------------------------|------------------|-----------| file: BM125WTS3R1001.fastq.gz | reads: 45,823,405 | P1ABWT file: BM126Pol1degradedS4R1001.fastq.gz | reads: 34,609,056 | P1ABP1KO file: BM127Pol2degradedS5R1001.fastq.gz | reads: 28,356,451 | P1ABP2KO file: BM128WTS6R1001.fastq.gz | reads: 34,639,433 | CTCFABWT file: BM129WTS7R1001.fastq.gz | reads: 49,503,419 | CTCFABWT file: BM130WTS8R1001.fastq.gz | reads: 47,559,479 | CTCFABWT file: BM131Pol1degradedS9R1001.fastq.gz | reads: 38,030,748 | CTCFABP1KO file: BM132Pol2degradedS10R1001.fastq.gz | reads: 28,916,144 | CTCFABP2KO file: BM133WTS11R1001.fastq.gz | reads: 63,835,066 | K9ABWT file: BM134Pol1degradedS12R1001.fastq.gz | reads: 54,217,202 | K9ABP1KO file: BM135Pol2degradedS13R1001.fastq.gz | reads: 53,024,658 | K9ABP2KO file: BM136WTS14R1001.fastq.gz | reads: 49,808,186 | K9ABWT file: BM137Pol1degradedS15R1001.fastq.gz | reads: 53,189,723 | K9ABP1KO file: BM138Pol2degradedS16R1001.fastq.gz | reads: 72,659,514 | K9ABP2KO file: BM139Pol1degradedS17R1001.fastq.gz | reads: 21,882,985 | CTCFABP1KO file: BM140Pol2degradedS18R1001.fastq.gz | reads: 23,661,905 | CTCFABP2KO
Changing names to:
mv BM125WTS3R2001.fastq.gz P1ABWTS3R2001.fastq.gz mv BM126Pol1degradedS4R2001.fastq.gz P1ABP1KOS4R2001.fastq.gz mv BM127Pol2degradedS5R2001.fastq.gz P1ABP2KOS5R2001.fastq.gz mv BM128WTS6R2001.fastq.gz CTCFABWTS6R2001.fastq.gz mv BM129WTS7R2001.fastq.gz CTCFABWTS7R2001.fastq.gz mv BM130WTS8R2001.fastq.gz CTCFABWTS8R2001.fastq.gz mv BM131Pol1degradedS9R2001.fastq.gz CTCFABP1KOS9R2001.fastq.gz mv BM132Pol2degradedS10R2001.fastq.gz CTCFABP2KOS10R2001.fastq.gz mv BM133WTS11R2001.fastq.gz K9ABWTS11R2001.fastq.gz mv BM134Pol1degradedS12R2001.fastq.gz K9ABP1KOS12R2001.fastq.gz mv BM135Pol2degradedS13R2001.fastq.gz K9ABP2KOS13R2001.fastq.gz mv BM136WTS14R2001.fastq.gz K9ABWTS14R2001.fastq.gz mv BM137Pol1degradedS15R2001.fastq.gz K9ABP1KOS15R2001.fastq.gz mv BM138Pol2degradedS16R2001.fastq.gz K9ABP2KOS16R2001.fastq.gz mv BM139Pol1degradedS17R2001.fastq.gz CTCFABP1KOS17R2001.fastq.gz mv BM140Pol2degradedS18R2001.fastq.gz CTCFABP2KOS18R2001.fastq.gz
Then we can normalize based on read #
```shell
!/bin/bash
SBATCH -A p32171 ##--## SLURM start
SBATCH -p short
SBATCH -N 1
SBATCH --ntasks-per-node=1
SBATCH --mem=150G
SBATCH --mail-user=lucascarter2025@u.northwestern.edu
SBATCH --mail-type=BEGIN
SBATCH --mail-type=END
SBATCH -t 3:00:00
SBATCH --job-name=downsample
SBATCH --output=outlog
SBATCH --error=errlog
cutoff=1000000 norm=2000000 highnorm=10000000 seed=100
Paths
inpath=$1 ## usage sbatch
module load seqtk/Dec17
for file in *R1_001.fastq.gz; do echo "${file}";
File names
reads=$(expr $(zcat ${file} | wc -l) / 4) R2=$(echo $file | sed 's/R1001.fastq.gz/R2001.fastq.gz/g') R1out=$(echo $file | sed 's/R1001.fastq.gz/sampledR1001.fastq/g') R2out=$(echo $R2 | sed 's/R2001.fastq.gz/sampledR2001.fastq/g')
if [[ $reads -lt $cutoff ]] ## If there is no path to seacr software, use MACS2 then
echo -e "\t Sample ${file} has too few reads: ${reads} \n"
elif [[ $reads -lt $norm ]] && [[ $reads -gt $cutoff ]] then
echo -e "\t Sample ${file} with ${reads} reads: sampled to 1,500,000 " echo -e "\t seqtk sample -s $seed ${inpath}${file} 1500000 > ${outpath}${R1out} \n"
seqtk sample -s $seed ${inpath}${file} 1500000 > ${outpath}${R1out} seqtk sample -s $seed ${inpath}${R2} 1500000 > ${outpath}${R2out}
elif [[ $reads -gt $highnorm ]] then
echo -e "\t Sample ${file} with ${reads} reads: sampled to $highnorm " echo -e "\t seqtk sample -s $seed ${inpath}${file} $highnorm > ${outpath}${R1out} \n"
seqtk sample -s $seed ${inpath}${file} $highnorm > ${outpath}${R1out} seqtk sample -s $seed ${inpath}${R2} $highnorm > ${outpath}${R2out}
else
echo -e "\t Sample ${file} with ${reads} reads: sampled to $norm " echo -e "\t seqtk sample -s $seed ${inpath}${file} $norm > ${outpath}${R1out} \n"
seqtk sample -s $seed ${inpath}${file} $norm > ${outpath}${R1out} seqtk sample -s $seed ${inpath}${R2} $norm > ${outpath}${R2out}
fi done
gzip *.fastq
```
I checked the duplication rate using the following code and found that the rates for each sample were quite high (consistent with overamplication by PCR (my fault) and oversequencing by the core (NuSeq's fault)). This publication addresses these issues and found similar duplication issues. Here is the code I used:
```R
Summarize the duplication information from the picard summary outputs.
n=3 dir.sam <- "/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/" subdirs <- c("S1", "S2", "S3") sam.path <- paste0(dir.sam, subdirs[n],"/SAM/") hists <- list.files(sam.path, pattern = ".dupMark.txt")
dupResult = c() for(hist in hists){ dupRes = read.table(paste0(sam.path,hist), header = TRUE, fill = TRUE)
histInfo = strsplit(hist, "")[[1]] dupResult = data.frame(Histone = histInfo[1], Replicate = histInfo[2], MappedFragNumhg38 = dupRes$READPAIRSEXAMINED[1] %>% as.character %>% as.numeric, DuplicationRate = dupRes$PERCENTDUPLICATION[1] %>% as.character %>% as.numeric * 100, EstimatedLibrarySize = dupRes$ESTIMATEDLIBRARYSIZE[1] %>% as.character %>% as.numeric) %>% mutate(UniqueFragNum = MappedFragNumhg38 * (1-DuplicationRate/100)) %>% rbind(dupResult, .) }
```
Owner
- Name: Backman Lab
- Login: BackmanLab
- Kind: organization
- Location: United States of America
- Repositories: 2
- Profile: https://github.com/BackmanLab
GitHub Events
Total
- Watch event: 2
Last Year
- Watch event: 2