darevskia-pericentromere-analysis

Analysis of pericentromeric sequences of Darevskia lizards

https://github.com/nikitin-p/darevskia-pericentromere-analysis

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.0%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

Analysis of pericentromeric sequences of Darevskia lizards

Basic Info
  • Host: GitHub
  • Owner: nikitin-p
  • License: mit
  • Language: Nextflow
  • Default Branch: master
  • Size: 3.16 MB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 4 years ago · Last pushed about 2 years ago
Metadata Files
Readme License Citation

README.md

darevskia-pericentromere-analysis

The pipeline was developed using Nextflow DSL2 by Pavel Nikitin and Sviatoslav Sidorov.

Introduction

Here we describe the pipeline used for the analysis of the pericentromeric sequences of Darevskia raddei nairensis and D. valentini, parental species of a hybrid parthenogenetic lizard D. unisexualis. In targeted sequencing data obtained from the pericentromeres of the parental species, we search for tandem repeat monomers and, based on them, predict species-specific pericentromeric DNA FISH probes to differentially stain parental subgenomes in the hybrid karyotype.

Requirements

  • Nextflow v21.10.6 (we developed the pipeline with this Nextflow version) or Nextflow v23.04.1 (the only other version with which we tested the pipeline).

  • Singularity v3.8.7 or Docker v23.0.1. We developed the pipeline using these versions and additionally tested it with Singularity v3.11.3.

Usage

  1. Clone the repository:

bash git clone https://github.com/nikitin-p/darevskia-pericentromere-analysis.git

  1. Run the pipeline as follows:

bash nextflow run darevskia-pericentromere-analysis/main.nf \ -profile <docker/singularity/.../institute> \ [--from_fastq [--enable_magicblast --db_dir <dir>] [--enable_tarean]]

Options

Without optional arguments, the pipeline will start from pre-assembled contigs included in this repository.

  • --from_fastq Start the pipeline from raw reads. The reads in the fastq format will be downloaded automatically from SRA. By deafult, the FastQC analysis and read trimming will be performed, but contigs will not be assembled (unless the --enable_tarean option is also set): instead, the pre-assembled contigs will be used.

    • --enable_magicblast Assess contamination with Magic-BLAST. Warning! This step is resource-heavy and may require several days, depending on available resources. This option can be specified only with the --from_fastq option and requires the --db_dir option. Caveat! May not work with Nextflow 23.04.1 or, possibly, higher. See the db_dir channel creation in workflows/darevskia.nf.

      • --db_dir <magicblast_dir> Directory with Magic-BLAST databases (see the Input section for details). This option can be specified only with the --enable_magicblast option.
    • --enable_tarean Assemble contigs with TAREAN. Warning! This step is resource-heavy and may require several days, depending on available resources. This option can be specified only with the --from_fastq option.

Input

The pipeline requires no input data. By default, it will start from pre-assembled contigs. Otherwise, if the --from_fastq option is specified, the pipeline will start from raw reads that it will download.

We provide contigs that we assembled before the development of this pipeline with TAREAN and used for our analysis. In our pipeline, we also include the contig assembly step using TAREAN. However, it is switched off by default because, when run in this pipeline within a container, TAREAN does not exactly reproduce the contigs that we assembled and used.

For the contamination assessment, we used the following Magic-BLAST databases version 5: 16S_ribosomal_RNA, ref_viroids_rep_genomes, ref_viruses_rep_genomes, ref_prok_rep_genomes, ref_euk_rep_genomes. They can be downloaded from https://ftp.ncbi.nlm.nih.gov/blast/db/v5 using, for example, lftp client (see man lftp). Unpack the downloaded parts of the databases. The directory with the Magic-BLAST databases must have the following structure:

magicblast_dir ├── ref_viroids_rep_genomes │ ├── ref_viroids_rep_genomes.ndb │ ├── ... │ ├── ref_viroids_rep_genomes.nto │ ├── taxdb.btd │ └── taxdb.bti ├── ref_prok_rep_genomes │ ├── ref_prok_rep_genomes.00.nhr │ ├── ... │ ├── ref_prok_rep_genomes.00.nsq │ ├── ref_prok_rep_genomes.01.nhr │ ├── ... │ ├── ref_prok_rep_genomes.01.nsq │ ├── ... │ ├── taxdb.btd │ ├── taxdb.bti | └── taxonomy4blast.sqlite3 ...

Output

The output is placed in the results folder. If run without parameters (i.e., starting from the pre-assembed contigs), the pipeline will generate the following subfolders in the results folder:

  • quast - Quality assesment of contigs with QUAST, including PDF reports.
  • preprocesstrf - All contigs and the top 10% highly covered contigs in the FASTA format (for Tandem Repeat Finder) and, with stats, in the TSV format (for the analysis in R).
  • trf - Tandem Repeat Finder (TRF) output in its DAT format.
  • preprocessr - TRF output in a tabular format: all tandem repeat monomers and tandem repeat monomers from the top 10% highly covered contigs, annotated with the source contigs.
  • monomerprobe - Tables of pairwise edit distances between all tandem repeat monomers found in the top 10% highly covered contigs.
  • rplots - Plots of the GC-content and sequence length distributions, in PDF, and the corresponding knitted R notebook, in HTML.
  • parsesam - Tables of predicted DNA FISH probes annotated as "mapped" or "unmapped" to the contigs of the opposite species.
  • bowtie2build - Bowtie2 index files for the full sets of contigs.
  • bowtie2crossalign - Alignment of manually selected candidate probes to the contigs of the opposite species, in the SAM format.
  • bowtie2clsatalign - Alignment of the CLsat36 sequence to contigs, in the SAM format.
  • extractcontig - Contigs, in the FASTA format, to which the CLsat36 sequence aligned. The predicted DNA FISH probes originate from these contigs.

If --from_fastq is set, then, depending on additional options, the pipeline will generate the following additional subfolders:

  • reads - Gzipped raw reads, in the FASTQ format.
  • fastqc - FASTQC reports on the raw reads, in HTML and zipped.
  • trimmomatic - Trimmed forward/reverse paired/unpaired reads, in the gzipped FASTQ format.
  • magicblast - Contamination assessment report produced by Magic-BLAST as a table, in the TXT format.
  • parsemagicblast - Summarised top predicted contaminants from the Magic-BLAST report, in a space-delimited table, in the TXT format.
  • interlacefasta - Interlaced reads prepared as input for TAREAN, in the FASTA format.
  • repeatexplorer - Contigs assembled with TAREAN, in the FASTA format. Importantly, the output of the TAREAN module in the pipeline does not exactly match the pre-assembled contigs.

Repository structure

  • bin - R scripts run in Nextflow modules.
  • blast_results - BLAST matches of the D. raddei nairensis probe CLsat30radn (FileS1.txt) and the corresponding BLAST run parameters (FileS2.asn).
  • clsat36/clsat36.fasta - CLsat36 sequence.
  • conf - Nextflow configs.
  • contigs - Pre-assembled contigs.
  • contigsforprobes - Contigs with the CLsat monomers from which probes were manually selected.
  • modules - Nextflow modules implementing the steps of the pipeline.
  • primer/primer.fasta - DOP-PCR primer used in library preparations.
  • probes - FASTA files with validated species-specific DNA FISH probes for D. raddei nairensis and D. valentini.
  • result_plots - Plots produced by the pipeline for Fig. 1b-c and Fig. S3A-B.
  • rmd/plotGClength_distr.Rmd - R Markdown notebook with the analysis of the GC content and length of contigs and tandem repeat monomers.
  • workflows/darevskia.nf - The workflow that implements the pipeline.
  • CITATIONS.md - References to the software tools and R packages used in the pipeline, with versions.
  • main.nf - A wrapper workflow that runs workflows/darevskia.nf.
  • nextflow.config - The main Nextflow config file that includes the config files from the conf folder.

Citations

If you use our pipeline, please cite our paper:

Nikitin, P., Sidorov, S., Liehr, T., Klimina, K., Al‐Rikabi, A., Korchagin, V., Kolomiets, O., Arakelyan, M., & Spangenberg, V. (2024). Variants of a major DNA satellite discriminate parental subgenomes in a hybrid parthenogenetic lizard Darevskia unisexualis (Darevsky, 1966). Journal of Experimental Zoology Part B: Molecular and Developmental Evolution, 1-12. https://doi.org/10.1002/jez.b.2324412

An extensive list of references for the tools used in the pipeline can be found in the CITATIONS.md file.

Owner

  • Login: nikitin-p
  • Kind: user
  • Location: Moscow

Student, Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University

Citation (CITATIONS.md)

# darevskia-pericentromere-analysis: Citations

## [Nextflow](https://doi.org/10.1038/nbt.3820) 

v21.10.6

> Di Tommaso P., Chatzou M., Floden E.W., Barja P.P., Palumbo E., Notredame C. (2017). Nextflow enables reproducible computational workflows. Nat Biotechnol 35:316-319. PMID: 28398311.

## [nf-core](https://doi.org/10.1038/s41587-020-0439-x)

> Ewels P., Peltzer A., Fillinger S., Patel H., Alneberg J., Wilm A., Garcia M.U., Di Tommaso P. & Nahnsen S. (2020). The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol 38:276-278. PMID: 32055031.

## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) v0.11.9 

- [Trimmomatic](https://doi.org/10.1093/bioinformatics/btu170) v0.39 
  > Bolger A.M., Lohse M., Usadel B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114-2120. PMID: 24695404.

- [NCBI Magic-BLAST](https://doi.org/10.1186/s12859-019-2996-x) v1.6.0
  > Boratyn G.M., Thierry-Mieg J., Thierry-Mieg D., Busby B., Madden T.L. (2019). Magic-BLAST, an accurate RNA-seq aligner for long and short reads. BMC Bioinformatics 20:405. PMID: 31345161.

- [TAREAN (TAndem REpeat ANalyzer)](https://doi.org/10.1093/nar/gkx257) v0.3.8 
  > Novák P., Robledillo L.A., Koblížková A., Vrbová I., Neumann P., Macas J. (2017). TAREAN: a computational tool for identification and characterization of satellite DNA from unassembled short reads. Nucleic Acids Res. 45:e111–e111. PMID: 28402514.

- [QUAST](https://doi.org/10.1093/bioinformatics/btt086) v5.0.2 
  > Gurevich A., Saveliev V., Vyahhi N., Tesler G. (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics 29:1072–1075. PMID: 23422339.

- [TRF (Tandem Repeats Finder)](https://doi.org/10.1093/nar/27.2.573) v4.09
  > Benson G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27:573–580. PMID: 9862982.

- [Bowtie2](https://doi.org/10.1038/nmeth.1923) v2.4.4
  > Langmead B., Salzberg S.L. (2012). Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359. PMID: 22388286.

## R packages

- [R](https://www.R-project.org/) v3.6.3
  > R Core Team (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

- [tidyr](https://cran.r-project.org/web/packages/tidyr/index.html) v1.2.0
  > Wickham H., Vaughan D., Girlich M. (2023). tidyr: Tidy Messy Data.

- [dplyr](https://cran.r-project.org/web/packages/dplyr/index.html) v1.0.9
  > Wickham H., François R., Henry L., Müller K., Vaughan D. (2023). dplyr: A Grammar of Data Manipulation.

- [stringr](https://cran.r-project.org/web/packages/stringr/index.html) v1.4.0
  > Wickham H. (2022). stringr: Simple, Consistent Wrappers for Common String Operations.

- [utils](https://www.R-project.org/) v3.6.3
  > R Core Team (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

- [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html) v3.3.6
  > Wickham H. (2016). Ggplot2. Springer Science+Business Media, LLC, New York, NY.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)
  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://doi.org/10.1038/s41592-018-0046-7)
  > Grüning B., Dale R., Sjödin A., Chapman B.A., Rowe J., Tomkins-Tinch C.H., Valieris R., Köster J., The Bioconda Team (2018). Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods 15:475-476. PMID: 29967506.

- [Docker](https://www.docker.com/)

- [Singularity](https://doi.org/10.1371/journal.pone.0177459)
  > Kurtzer G.M., Sochat V., Bauer M.W. (2017). Singularity: Scientific containers for mobility of compute. PLoS One 12:e0177459. PMID: 28494014.

GitHub Events

Total
Last Year

Dependencies

modules/nf-core/modules/bowtie2/build/meta.yml cpan
modules/nf-core/modules/fastqc/meta.yml cpan