https://github.com/camilogarciabotero/reproducing-finstrlova-etal-2022

Reproducible steps for bacteriophage K differential gene expression analysis

https://github.com/camilogarciabotero/reproducing-finstrlova-etal-2022

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 10 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
    1 of 1 committers (100.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (4.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Reproducible steps for bacteriophage K differential gene expression analysis

Basic Info
  • Host: GitHub
  • Owner: camilogarciabotero
  • Language: R
  • Default Branch: main
  • Size: 13.9 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 3 years ago · Last pushed almost 3 years ago
Metadata Files
Readme

README.md

Reproducible data analysis of Finstrlová et al. (2022)

All the libraries are loaded in the deseq_processing.R script including the tidyverse, DESeq2 and CompleHeatmap libraries.

r source("scripts/deseq_processing.R")

Genome Annotation

bash pharokka.py -i phageK-NC.fasta -o phanotate-annotation -p phanotate-phageK-NC -g phanotate -d pharokka-db/ -t 8

Mapping

Downloading data

Data was downloaded from the SRA explorer page using a given list of curl lines as in the data-links.sh script.

Trimming with Trimmomatic

Quality filtering and trimming was done using trimmomatic

bash function trimming() { trimmomatic SE -threads 8\ $1\ Trimmed-${1}\ ILLUMINACLIP:TruSeq3-SE:2:30:10\ SLIDINGWINDOW:4:15\ LEADING:22\ TRAILING:22\ MINLEN:20\ AVGQUAL:22 }

bash for i in SRR*gz; do trimming $i; done

Map alignment with STAR

``` bash STAR --runMode genomeGenerate\ --runThreadN 32\ --genomeDir $GENOMEDIR\ --genomeFastaFiles $GENOMEFASTA\ --genomeSAindexNbases 7\

STAR --runMode alignReads\ --runThreadN 32\ --genomeDir $GENOMEDIR\ --readFilesManifest $READSMANIFEST\ --readFilesCommand gunzip -c\ --readFilesPrefix $READS_PREFIX\ --outFilterMultimapScoreRange 0\ --outFilterMatchNmin 30\ --outFilterMatchNminOverLread 0.95\ --outFilterMismatchNoverLmax 0.02\ --outFilterMismatchNoverReadLmax 1\ --alignIntronMin 20\ --alignIntronMax 1\ --outSAMtype BAM SortedByCoordinate\ --limitBAMsortRAM 1096346262\ --outBAMcompression 10\ --bamRemoveDuplicatesType UniqueIdentical\ --outSAMattributes NH HI AS nM RG\

samtools index Aligned.sortedByCoord.out.bam ```

Counting expression

Counting expression for the coverage analysis

We first split the alignment by the names of the grouped reads id (joining all replicates per group) from the manifest.

bash samtools split -f %! ungrouped-mapping/grouped-mapping.out.bam

Later we generate the coverage data for each of the time points of the infection on the forward and reverse strand[^1]:

``` bash for i in {2,5,10,20,30}; do mosdepth -t 8 -f phageK-NC.fasta -R Newman-min${i} forward-newman-min-${i} -F 1812 grouped-mapping.out.bam; done

for i in {2,5,10,20,30}; do mosdepth -t 8 -f phageK-NC.fasta -R Newman-min${i} reverse-newman-min-${i} -i 16 grouped-mapping.out.bam; done ```

Counting expression for the DE analysis

We first split the alignment by the names of the ungrouped reads id from the manifest.

bash samtools split -f %! ungrouped-mapping/ungrouped-mapping.out.bam

Then using BamToCount with each of the annotation files will generate a counts per library.

bash for i in *; do bamtocounts -r phageK-NC.fasta --coords -n --header <annotation>.gff3 ${i} > ${i}.tsv; done

Coverage analysis

``` r forwardfiles <- dir_ls(path = "grouped-mapping/mosdepth-counts-forward/", regexp = "*.gz$")

reversefiles <- dir_ls(path = "grouped-mapping/mosdepth-counts-reverse/", regexp = "*.gz$")

forward <- forwardfiles |> mapdf(bedread, .id = "filename") |> as_tibble() |> mutate(filename = as.factor(filename), strand = '-')

reverse <- reversefiles |> mapdf(bedread, .id = "filename") |> as_tibble() |> mutate(filename = as.factor(filename), strand = '+')

data <- rbind(reverse, forward) |> mutate( timepoint = as.factor(strextract(filename, "\d+")) ) |> mutate(timepoint = fctrelevel(timepoint, c('2','5','10','20','30'))) |> filter(coverage != 0) |> mutate(sqrt_cov = sqrt(coverage))

data ```

# A tibble: 680,966 × 8
   filename                 chr   start   end coverage strand timepoint sqrt_cov
   <fct>                    <chr> <int> <int>    <int> <chr>  <fct>        <dbl>
 1 grouped-mapping/mosdept… NC_0…    52    56        1 +      10            1   
 2 grouped-mapping/mosdept… NC_0…    56    60        2 +      10            1.41
 3 grouped-mapping/mosdept… NC_0…    60    61        8 +      10            2.83
 4 grouped-mapping/mosdept… NC_0…    61    62       32 +      10            5.66
 5 grouped-mapping/mosdept… NC_0…    62    63       51 +      10            7.14
 6 grouped-mapping/mosdept… NC_0…    63    64       52 +      10            7.21
 7 grouped-mapping/mosdept… NC_0…    64   100       54 +      10            7.35
 8 grouped-mapping/mosdept… NC_0…   100   113       55 +      10            7.42
 9 grouped-mapping/mosdept… NC_0…   113   127       57 +      10            7.55
10 grouped-mapping/mosdept… NC_0…   127   130       56 +      10            7.48
# ℹ 680,956 more rows

r ggplot(data, aes(x = start, y = timepoint, height = coverage)) + geom_density_ridges(aes(fill = strand, color = strand), stat = "identity", scale = 5, alpha = .7, size = .1 ) + scale_y_discrete(expand = c(0, 0)) + scale_x_continuous(expand = c(0, 0), breaks = c(0, 20000, 40000, 60000, 80000, 100000, 120000, 140000)) + theme_ridges() + theme(legend.position = "bottom") + scale_fill_manual(values = c("#F4B6B7", "#BAD4B7")) + scale_color_manual(values = c("#F4B6B7", "#BAD4B7"))

r ggsave("figs/ridges-perbase.pdf", width = 9, height = 9, dpi = 200)

Differential gene expression analyses

NCBI annotation

Generate the expression data file from all file counts

r process_expression_data

function (folder_path, output_file_path) 
{
    expressionfiles <- dir_ls(path = folder_path, regexp = "*.tsv$")
    expressionData <- pivot_wider(mutate(select(dplyr::rename(mutate(as_tibble(map_df(expressionfiles, 
        read_tsv, .id = "filename")), Filename = as.factor(filename), 
        Sample = as.factor(str_extract(filename, "Newman.+(?=\\.bam)")), 
        ), Gene = `#Feature`), Sample, Counts, Gene), Gene = str_replace(Gene, 
        "cds-|UAKGERNU_", "")), names_from = Sample, values_from = Counts)
    write_tsv(expressionData, file = output_file_path)
    return(expressionData)
}

r process_expression_data(folder_path = "ungrouped-mapping/ncbi-mapping/", output_file_path = "ungrouped-mapping/ncbi-phageK-expression.tsv")

# A tibble: 233 × 19
   Gene           `Newman-control-R01` `Newman-control-R02` `Newman-control-R03`
   <chr>                         <dbl>                <dbl>                <dbl>
 1 YP_009041223.1                    0                    0                    0
 2 YP_009041224.1                    0                    0                    0
 3 YP_009041225.1                    0                    0                    0
 4 YP_009041226.1                    0                    0                    0
 5 YP_009041227.1                    0                    0                    0
 6 YP_009041228.1                    1                    0                    0
 7 YP_009041229.1                    0                    0                    0
 8 YP_009041230.1                    0                    0                    0
 9 YP_009041231.1                    1                    0                    0
10 YP_009041232.1                    0                    0                    0
# ℹ 223 more rows
# ℹ 15 more variables: `Newman-min10-R01` <dbl>, `Newman-min10-R02` <dbl>,
#   `Newman-min10-R03` <dbl>, `Newman-min2-R01` <dbl>, `Newman-min2-R02` <dbl>,
#   `Newman-min2-R03` <dbl>, `Newman-min20-R01` <dbl>,
#   `Newman-min20-R02` <dbl>, `Newman-min20-R03` <dbl>,
#   `Newman-min30-R01` <dbl>, `Newman-min30-R02` <dbl>,
#   `Newman-min30-R03` <dbl>, `Newman-min5-R01` <dbl>, …

then process all files to get the DESeq2 object

r process_dds

function (metadata_file_path, expression_file_path) 
{
    metadata <- read.table(metadata_file_path, sep = "\t", header = T, 
        row.names = 1)
    expressionData <- read.table(expression_file_path, sep = "\t", 
        header = T, row.names = 1, check.names = F)
    expressionDataOrdered <- select(expressionData, "Newman-control-R01", 
        "Newman-control-R02", "Newman-control-R03", "Newman-min2-R01", 
        "Newman-min2-R02", "Newman-min2-R03", "Newman-min5-R01", 
        "Newman-min5-R02", "Newman-min5-R03", "Newman-min10-R01", 
        "Newman-min10-R02", "Newman-min10-R03", "Newman-min20-R01", 
        "Newman-min20-R02", "Newman-min20-R03", "Newman-min30-R01", 
        "Newman-min30-R02", "Newman-min30-R03")
    all(colnames(expressionData) == rownames(metadata))
    expValues <- data.matrix(expressionDataOrdered)
    expValuesInteger <- round(expValues)
    ddsObject <- DESeqDataSetFromMatrix(countData = expValuesInteger, 
        colData = metadata, design = ~time)
    dds <- DESeq(ddsObject)
    return(dds)
}

r ncbiDds <- process_dds(metadata_file_path = "ungrouped-mapping/mapping-metadata.tsv", expression_file_path = "ungrouped-mapping/ncbi-phageK-expression.tsv")

Generate the expression matrix for the heatmap:

r get_fold_change_matrix

function (dds) 
{
    time_M2_vs_C <- lfcShrink(dds, coef = "time_M2_vs_C", type = "normal", 
        lfcThreshold = 1)
    time_M5_vs_C <- lfcShrink(dds, coef = "time_M5_vs_C", type = "normal", 
        lfcThreshold = 1)
    time_M10_vs_C <- lfcShrink(dds, coef = "time_M10_vs_C", type = "normal", 
        lfcThreshold = 1)
    time_M20_vs_C <- lfcShrink(dds, coef = "time_M20_vs_C", type = "normal", 
        lfcThreshold = 1)
    time_M30_vs_C <- lfcShrink(dds, coef = "time_M30_vs_C", type = "normal", 
        lfcThreshold = 1)
    M2 <- pivot_longer(mutate(as_tibble(time_M2_vs_C, rownames = "gene"), 
        sample = "M2"), cols = !c(gene, sample), names_to = "parameter", 
        values_to = "value")
    M5 <- pivot_longer(mutate(as_tibble(time_M5_vs_C, rownames = "gene"), 
        sample = "M5"), cols = !c(gene, sample), names_to = "parameter", 
        values_to = "value")
    M10 <- pivot_longer(mutate(as_tibble(time_M10_vs_C, rownames = "gene"), 
        sample = "M10"), cols = !c(gene, sample), names_to = "parameter", 
        values_to = "value")
    M20 <- pivot_longer(mutate(as_tibble(time_M20_vs_C, rownames = "gene"), 
        sample = "M20"), cols = !c(gene, sample), names_to = "parameter", 
        values_to = "value")
    M30 <- pivot_longer(mutate(as_tibble(time_M30_vs_C, rownames = "gene"), 
        sample = "M30"), cols = !c(gene, sample), names_to = "parameter", 
        values_to = "value")
    foldChangeData <- pivot_wider(select(filter(rbind(M2, M5, 
        M10, M20, M30), parameter == "log2FoldChange"), gene, 
        sample, value), names_from = sample, values_from = value)
    foldChangeMatrix <- as.matrix(foldChangeData[, -1])
    rownames(foldChangeMatrix) <- foldChangeData$gene
    return(foldChangeMatrix)
}

r ncbiFoldChangeMatrix <- get_fold_change_matrix(ncbiDds)

Using the ComplexHeatmap library we will create a circular heatmap:

r circular_heatmap

function (foldChangeMatrix, output_file) 
{
    circos.clear()
    hc <- hclust(dist(foldChangeMatrix), method = "complete")
    circos.par(start.degree = 90, gap.degree = 3)
    pdf(output_file, width = 10, height = 10)
    col_fun1 <- colorRamp2(c(-5, 0, 10), c("#467CBC", "#F8F7CE", 
        "#D22729"))
    lgd <- Legend(title = "Expression", col_fun = col_fun1)
    circos.heatmap(foldChangeMatrix, col = col_fun1, split = 3, 
        show.sector.labels = TRUE, rownames.side = "outside", 
        dend.side = "inside", order(as.numeric(gsub("M", "", 
            colnames(foldChangeMatrix)))), dend.callback = function(dend, 
            m, si) {
            dendsort(dend)
        }, track.height = 0.2)
    grid.draw(lgd)
    dev.off()
}

r circular_heatmap(ncbiFoldChangeMatrix, "figs/ncbi-heatmap.pdf")

quartz_off_screen 
                2 

We can also generate a linear heatmap as well:

r pdf("figs/ncbi-plain-heatmap.pdf", width = 8, height = 10.8) col_order <- c('M2', 'M5', 'M10', 'M20', 'M30') colnames(ncbiFoldChangeMatrix) <- col_order ncbiHmap <- Heatmap(ncbiFoldChangeMatrix, column_names_gp = grid::gpar(fontsize = 10), row_names_gp = grid::gpar(fontsize = 2.5), column_order = order(as.numeric(gsub("M", "", colnames(ncbiFoldChangeMatrix))))) draw(ncbiHmap) dev.off()

quartz_off_screen 
                2 

Similarly we will get the expression data in a tibble format:

r get_fold_change_data

function (dds) 
{
    time_M2_vs_C <- mutate(as_tibble(lfcShrink(dds, coef = "time_M2_vs_C", 
        type = "normal", lfcThreshold = 1), rownames = "gene"), 
        sample = "M2")
    time_M5_vs_C <- mutate(as_tibble(lfcShrink(dds, coef = "time_M5_vs_C", 
        type = "normal", lfcThreshold = 1), rownames = "gene"), 
        sample = "M5")
    time_M10_vs_C <- mutate(as_tibble(lfcShrink(dds, coef = "time_M10_vs_C", 
        type = "normal", lfcThreshold = 1), rownames = "gene"), 
        sample = "M10")
    time_M20_vs_C <- mutate(as_tibble(lfcShrink(dds, coef = "time_M20_vs_C", 
        type = "normal", lfcThreshold = 1), rownames = "gene"), 
        sample = "M20")
    time_M30_vs_C <- mutate(as_tibble(lfcShrink(dds, coef = "time_M30_vs_C", 
        type = "normal", lfcThreshold = 1), rownames = "gene"), 
        sample = "M30")
    foldChangeData <- rbind(time_M2_vs_C, time_M5_vs_C, time_M10_vs_C, 
        time_M20_vs_C, time_M30_vs_C)
    return(foldChangeData)
}

r ncbiFoldChangeData <- get_fold_change_data(ncbiDds)

So that we can make volvano plots from different timepoints:

r volcano_plot

function (foldChangeData, output_file) 
{
    phanotatefoldChangeDataLog <- foldChangeData %>% mutate(log_pValue = -log10(padj)) %>% 
        select(gene, log_pValue, log2FoldChange, sample)
    ggplot(phanotatefoldChangeDataLog, aes(x = log2FoldChange, 
        y = log_pValue, label = gene)) + geom_point() + geom_hline(yintercept = -log10(0.05), 
        linetype = "dotted", color = "red") + geom_vline(xintercept = c(-2, 
        2), linetype = "dotted", color = "red") + facet_wrap(~factor(sample, 
        levels = c("M2", "M5", "M10", "M20", "M30")), ncol = 1) + 
        annotate("text", label = "P value < 0.05", y = 2, x = 5.5, 
            size = 3, color = "red") + gghighlight(log_pValue > 
        -log10(0.05) & abs(log2FoldChange) > 2, label_key = gene) + 
        geom_text_repel(data = subset(phanotatefoldChangeDataLog, 
            log_pValue > -log10(0.05)), segment.size = 0.2, segment.color = "grey50", 
            box.padding = 0.5, min.segment.length = 0) + theme_bw() + 
        theme(axis.text.x = element_text(size = 14), axis.text.y = element_text(size = 14), 
            axis.title.x = element_text(size = 16, face = "bold"), 
            axis.title.y = element_text(size = 16, face = "bold"), 
            legend.text = element_text(size = 16), strip.background = element_blank(), 
            legend.position = "right") + scale_x_continuous(breaks = seq(-6, 
        6, 1)) + scale_y_continuous(breaks = seq(0, 6, 1)) + 
        labs(x = "Log2(Fold change)", y = "-Log10(Adjusted P value)", 
            title = "Volcano plots", subtitle = "Differential gene expression of phageK during Newman strain at different sample points", 
            caption = "")
    ggsave(output_file, width = 10, height = 12)
}

r volcano_plot(ncbiFoldChangeData, "figs/ncbi-volcanoes.png")

Phanotate analysis

r process_expression_data(folder_path = "ungrouped-mapping/phanotate-mapping/", output_file_path = "ungrouped-mapping/phanotate-expression.tsv")

# A tibble: 276 × 19
   Gene     `Newman-control-R01` `Newman-control-R02` `Newman-control-R03`
   <chr>                   <dbl>                <dbl>                <dbl>
 1 CDS_0001                    0                    0                    0
 2 CDS_0002                    0                    0                    0
 3 CDS_0003                    0                    0                    0
 4 CDS_0004                    0                    0                    0
 5 CDS_0005                    0                    0                    0
 6 CDS_0006                    0                    0                    0
 7 CDS_0007                    0                    0                    0
 8 CDS_0008                    1                    0                    0
 9 CDS_0009                    0                    0                    0
10 CDS_0010                    0                    0                    0
# ℹ 266 more rows
# ℹ 15 more variables: `Newman-min10-R01` <dbl>, `Newman-min10-R02` <dbl>,
#   `Newman-min10-R03` <dbl>, `Newman-min2-R01` <dbl>, `Newman-min2-R02` <dbl>,
#   `Newman-min2-R03` <dbl>, `Newman-min20-R01` <dbl>,
#   `Newman-min20-R02` <dbl>, `Newman-min20-R03` <dbl>,
#   `Newman-min30-R01` <dbl>, `Newman-min30-R02` <dbl>,
#   `Newman-min30-R03` <dbl>, `Newman-min5-R01` <dbl>, …

r phanotateDds <- process_dds("ungrouped-mapping/mapping-metadata.tsv", "ungrouped-mapping/phanotate-expression.tsv")

r phannotateFoldChMatrix <- get_fold_change_matrix(phanotateDds)

using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).

Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
Reference: https://doi.org/10.1093/bioinformatics/bty895
using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).

Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
Reference: https://doi.org/10.1093/bioinformatics/bty895
using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).

Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
Reference: https://doi.org/10.1093/bioinformatics/bty895
using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).

Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
Reference: https://doi.org/10.1093/bioinformatics/bty895
using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).

Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
Reference: https://doi.org/10.1093/bioinformatics/bty895

r circular_heatmap(phannotateFoldChMatrix, "figs/phanotate-heatmap.pdf")

quartz_off_screen 
                2 

We can also generate a linear heatmap as well:

r pdf("figs/phanotate-plain-heatmap.pdf", width = 8, height = 10.8) col_order <- c('M2', 'M5', 'M10', 'M20', 'M30') colnames(phannotateFoldChMatrix) <- col_order phanotateHmap <- Heatmap(phannotateFoldChMatrix, column_names_gp = grid::gpar(fontsize = 10), row_names_gp = grid::gpar(fontsize = 2.5), column_order = order(as.numeric(gsub("M", "", colnames(phannotateFoldChMatrix))))) draw(phanotateHmap) dev.off()

quartz_off_screen 
                2 

r phanotateFoldChangeData <- get_fold_change_data(phanotateDds)

r volcano_plot(phanotateFoldChangeData, "figs/phanotate-volcanoes.pdf")

[^1]: For applying this per strand: https://github.com/brentp/mosdepth/issues/192

Owner

  • Name: Camilo García
  • Login: camilogarciabotero
  • Kind: user
  • Location: Bogotá, Colombia
  • Company: Universidad de los Andes

Biologist interested in applying bioinformatics and DS tools to understand evolution in different organisms. Currently working on bacteriophages and epigenomics

GitHub Events

Total
Last Year

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 8
  • Total Committers: 1
  • Avg Commits per committer: 8.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Camilo García c****2@u****o 8
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels