darevskia-pericentromere-analysis
Analysis of pericentromeric sequences of Darevskia lizards
https://github.com/nikitin-p/darevskia-pericentromere-analysis
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: ncbi.nlm.nih.gov -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary
Repository
Analysis of pericentromeric sequences of Darevskia lizards
Basic Info
- Host: GitHub
- Owner: nikitin-p
- License: mit
- Language: Nextflow
- Default Branch: master
- Size: 3.16 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
darevskia-pericentromere-analysis
The pipeline was developed using Nextflow DSL2 by Pavel Nikitin and Sviatoslav Sidorov.
Introduction
Here we describe the pipeline used for the analysis of the pericentromeric sequences of Darevskia raddei nairensis and D. valentini, parental species of a hybrid parthenogenetic lizard D. unisexualis. In targeted sequencing data obtained from the pericentromeres of the parental species, we search for tandem repeat monomers and, based on them, predict species-specific pericentromeric DNA FISH probes to differentially stain parental subgenomes in the hybrid karyotype.
Requirements
Nextflow v21.10.6(we developed the pipeline with this Nextflow version) orNextflow v23.04.1(the only other version with which we tested the pipeline).Singularity v3.8.7orDocker v23.0.1. We developed the pipeline using these versions and additionally tested it withSingularity v3.11.3.
Usage
- Clone the repository:
bash
git clone https://github.com/nikitin-p/darevskia-pericentromere-analysis.git
- Run the pipeline as follows:
bash
nextflow run darevskia-pericentromere-analysis/main.nf \
-profile <docker/singularity/.../institute> \
[--from_fastq [--enable_magicblast --db_dir <dir>] [--enable_tarean]]
Options
Without optional arguments, the pipeline will start from pre-assembled contigs included in this repository.
--from_fastqStart the pipeline from raw reads. The reads in thefastqformat will be downloaded automatically from SRA. By deafult, the FastQC analysis and read trimming will be performed, but contigs will not be assembled (unless the--enable_tareanoption is also set): instead, the pre-assembled contigs will be used.--enable_magicblastAssess contamination with Magic-BLAST. Warning! This step is resource-heavy and may require several days, depending on available resources. This option can be specified only with the--from_fastqoption and requires the--db_diroption. Caveat! May not work withNextflow 23.04.1or, possibly, higher. See thedb_dirchannel creation in workflows/darevskia.nf.--db_dir <magicblast_dir>Directory with Magic-BLAST databases (see the Input section for details). This option can be specified only with the--enable_magicblastoption.
--enable_tareanAssemble contigs with TAREAN. Warning! This step is resource-heavy and may require several days, depending on available resources. This option can be specified only with the--from_fastqoption.
Input
The pipeline requires no input data. By default, it will start from pre-assembled contigs. Otherwise, if the --from_fastq option is specified, the pipeline will start from raw reads that it will download.
We provide contigs that we assembled before the development of this pipeline with TAREAN and used for our analysis. In our pipeline, we also include the contig assembly step using TAREAN. However, it is switched off by default because, when run in this pipeline within a container, TAREAN does not exactly reproduce the contigs that we assembled and used.
For the contamination assessment, we used the following Magic-BLAST databases version 5: 16S_ribosomal_RNA, ref_viroids_rep_genomes, ref_viruses_rep_genomes, ref_prok_rep_genomes, ref_euk_rep_genomes. They can be downloaded from https://ftp.ncbi.nlm.nih.gov/blast/db/v5 using, for example, lftp client (see man lftp). Unpack the downloaded parts of the databases. The directory with the Magic-BLAST databases must have the following structure:
magicblast_dir
├── ref_viroids_rep_genomes
│ ├── ref_viroids_rep_genomes.ndb
│ ├── ...
│ ├── ref_viroids_rep_genomes.nto
│ ├── taxdb.btd
│ └── taxdb.bti
├── ref_prok_rep_genomes
│ ├── ref_prok_rep_genomes.00.nhr
│ ├── ...
│ ├── ref_prok_rep_genomes.00.nsq
│ ├── ref_prok_rep_genomes.01.nhr
│ ├── ...
│ ├── ref_prok_rep_genomes.01.nsq
│ ├── ...
│ ├── taxdb.btd
│ ├── taxdb.bti
| └── taxonomy4blast.sqlite3
...
Output
The output is placed in the results folder. If run without parameters (i.e., starting from the pre-assembed contigs), the pipeline will generate the following subfolders in the results folder:
- quast - Quality assesment of contigs with QUAST, including
PDFreports. - preprocesstrf - All contigs and the top 10% highly covered contigs in the
FASTAformat (for Tandem Repeat Finder) and, with stats, in theTSVformat (for the analysis in R). - trf - Tandem Repeat Finder (TRF) output in its
DATformat. - preprocessr - TRF output in a tabular format: all tandem repeat monomers and tandem repeat monomers from the top 10% highly covered contigs, annotated with the source contigs.
- monomerprobe - Tables of pairwise edit distances between all tandem repeat monomers found in the top 10% highly covered contigs.
- rplots - Plots of the GC-content and sequence length distributions, in
PDF, and the corresponding knitted R notebook, inHTML. - parsesam - Tables of predicted DNA FISH probes annotated as "mapped" or "unmapped" to the contigs of the opposite species.
- bowtie2build - Bowtie2 index files for the full sets of contigs.
- bowtie2crossalign - Alignment of manually selected candidate probes to the contigs of the opposite species, in the
SAMformat. - bowtie2clsatalign - Alignment of the CLsat36 sequence to contigs, in the
SAMformat. - extractcontig - Contigs, in the
FASTAformat, to which the CLsat36 sequence aligned. The predicted DNA FISH probes originate from these contigs.
If --from_fastq is set, then, depending on additional options, the pipeline will generate the following additional subfolders:
- reads - Gzipped raw reads, in the
FASTQformat. - fastqc - FASTQC reports on the raw reads, in
HTMLand zipped. - trimmomatic - Trimmed forward/reverse paired/unpaired reads, in the gzipped
FASTQformat. - magicblast - Contamination assessment report produced by Magic-BLAST as a table, in the
TXTformat. - parsemagicblast - Summarised top predicted contaminants from the Magic-BLAST report, in a space-delimited table, in the
TXTformat. - interlacefasta - Interlaced reads prepared as input for TAREAN, in the
FASTAformat. - repeatexplorer - Contigs assembled with TAREAN, in the
FASTAformat. Importantly, the output of the TAREAN module in the pipeline does not exactly match the pre-assembled contigs.
Repository structure
- bin - R scripts run in Nextflow modules.
- blast_results - BLAST matches of the D. raddei nairensis probe CLsat30radn (FileS1.txt) and the corresponding BLAST run parameters (FileS2.asn).
- clsat36/clsat36.fasta - CLsat36 sequence.
- conf - Nextflow configs.
- contigs - Pre-assembled contigs.
- contigsforprobes - Contigs with the CLsat monomers from which probes were manually selected.
- modules - Nextflow modules implementing the steps of the pipeline.
- primer/primer.fasta - DOP-PCR primer used in library preparations.
- probes - FASTA files with validated species-specific DNA FISH probes for D. raddei nairensis and D. valentini.
- result_plots - Plots produced by the pipeline for Fig. 1b-c and Fig. S3A-B.
- rmd/plotGClength_distr.Rmd - R Markdown notebook with the analysis of the GC content and length of contigs and tandem repeat monomers.
- workflows/darevskia.nf - The workflow that implements the pipeline.
- CITATIONS.md - References to the software tools and R packages used in the pipeline, with versions.
- main.nf - A wrapper workflow that runs
workflows/darevskia.nf. - nextflow.config - The main Nextflow config file that includes the config files from the
conffolder.
Citations
If you use our pipeline, please cite our paper:
Nikitin, P., Sidorov, S., Liehr, T., Klimina, K., Al‐Rikabi, A., Korchagin, V., Kolomiets, O., Arakelyan, M., & Spangenberg, V. (2024). Variants of a major DNA satellite discriminate parental subgenomes in a hybrid parthenogenetic lizard Darevskia unisexualis (Darevsky, 1966). Journal of Experimental Zoology Part B: Molecular and Developmental Evolution, 1-12. https://doi.org/10.1002/jez.b.2324412
An extensive list of references for the tools used in the pipeline can be found in the CITATIONS.md file.
Owner
- Login: nikitin-p
- Kind: user
- Location: Moscow
- Repositories: 2
- Profile: https://github.com/nikitin-p
Student, Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University
Citation (CITATIONS.md)
# darevskia-pericentromere-analysis: Citations ## [Nextflow](https://doi.org/10.1038/nbt.3820) v21.10.6 > Di Tommaso P., Chatzou M., Floden E.W., Barja P.P., Palumbo E., Notredame C. (2017). Nextflow enables reproducible computational workflows. Nat Biotechnol 35:316-319. PMID: 28398311. ## [nf-core](https://doi.org/10.1038/s41587-020-0439-x) > Ewels P., Peltzer A., Fillinger S., Patel H., Alneberg J., Wilm A., Garcia M.U., Di Tommaso P. & Nahnsen S. (2020). The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol 38:276-278. PMID: 32055031. ## Pipeline tools - [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) v0.11.9 - [Trimmomatic](https://doi.org/10.1093/bioinformatics/btu170) v0.39 > Bolger A.M., Lohse M., Usadel B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114-2120. PMID: 24695404. - [NCBI Magic-BLAST](https://doi.org/10.1186/s12859-019-2996-x) v1.6.0 > Boratyn G.M., Thierry-Mieg J., Thierry-Mieg D., Busby B., Madden T.L. (2019). Magic-BLAST, an accurate RNA-seq aligner for long and short reads. BMC Bioinformatics 20:405. PMID: 31345161. - [TAREAN (TAndem REpeat ANalyzer)](https://doi.org/10.1093/nar/gkx257) v0.3.8 > Novák P., Robledillo L.A., Koblížková A., Vrbová I., Neumann P., Macas J. (2017). TAREAN: a computational tool for identification and characterization of satellite DNA from unassembled short reads. Nucleic Acids Res. 45:e111–e111. PMID: 28402514. - [QUAST](https://doi.org/10.1093/bioinformatics/btt086) v5.0.2 > Gurevich A., Saveliev V., Vyahhi N., Tesler G. (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics 29:1072–1075. PMID: 23422339. - [TRF (Tandem Repeats Finder)](https://doi.org/10.1093/nar/27.2.573) v4.09 > Benson G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27:573–580. PMID: 9862982. - [Bowtie2](https://doi.org/10.1038/nmeth.1923) v2.4.4 > Langmead B., Salzberg S.L. (2012). Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359. PMID: 22388286. ## R packages - [R](https://www.R-project.org/) v3.6.3 > R Core Team (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. - [tidyr](https://cran.r-project.org/web/packages/tidyr/index.html) v1.2.0 > Wickham H., Vaughan D., Girlich M. (2023). tidyr: Tidy Messy Data. - [dplyr](https://cran.r-project.org/web/packages/dplyr/index.html) v1.0.9 > Wickham H., François R., Henry L., Müller K., Vaughan D. (2023). dplyr: A Grammar of Data Manipulation. - [stringr](https://cran.r-project.org/web/packages/stringr/index.html) v1.4.0 > Wickham H. (2022). stringr: Simple, Consistent Wrappers for Common String Operations. - [utils](https://www.R-project.org/) v3.6.3 > R Core Team (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. - [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html) v3.3.6 > Wickham H. (2016). Ggplot2. Springer Science+Business Media, LLC, New York, NY. ## Software packaging/containerisation tools - [Anaconda](https://anaconda.com) > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. - [Bioconda](https://doi.org/10.1038/s41592-018-0046-7) > Grüning B., Dale R., Sjödin A., Chapman B.A., Rowe J., Tomkins-Tinch C.H., Valieris R., Köster J., The Bioconda Team (2018). Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods 15:475-476. PMID: 29967506. - [Docker](https://www.docker.com/) - [Singularity](https://doi.org/10.1371/journal.pone.0177459) > Kurtzer G.M., Sochat V., Bauer M.W. (2017). Singularity: Scientific containers for mobility of compute. PLoS One 12:e0177459. PMID: 28494014.