intropipeline
Detection of Saccharomyces paradoxus DNA across Saccharomyces cerevisiae, and vice versa.
Science Score: 41.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: nature.com -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary
Keywords
Repository
Detection of Saccharomyces paradoxus DNA across Saccharomyces cerevisiae, and vice versa.
Basic Info
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 2
Topics
Metadata Files
README.md
NEWS:
:rocket: A v.1.1 with several improvements in stability, speed and memory consumption has been released.
intropipeline
An automated computational framework for detecting Saccharomyces paradoxus introgressions in Saccharomyces cerevisiae strains from paired-end illumina sequencing.
Description
v1.0. is described in Tellini, et al. 2024 Nat. EcoEvo, for detecting S.par introgressions in S.cer strains.
v1.1. contains the following implementations and changes:
- minimap2 replaced bwa mem almost halving the running time (see Heng Li 2018, Bioinformatics) achieving comparable results;
sample: ERR3010122
threads: 2
Architecture: x86_64
CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
| script | Elapsed Time | Maximum resident set size (GB) | | ------------- | ------------- | ------------- | | bwa mem + samtools (v1) | 6:21 (m:ss) | 1.3 | | minimap2 + samtools (v1.1) | 3:36 (m:ss) | 1.3 |
- improved the reproducibility of the mapping by implementing the standard samtools workflow according to samtools' guideline
- improved the roboustness of the mapping by appending the name of the strain to a checkpoint (cps) file (
./cps/cps.txt). The strains which names are stored in./cps/cps.txtwill not be mapped again. - introduced
data.table,lapplyand custom function for large file manipulation for reducing runtime and RAM load. example:
sample: ERR3010122
threads: 2
Architecture: x86_64
CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
| script | Elapsed Time (s) | Maximum resident set size (GB) | | ------------- | ------------- | ------------- | | parsermarker.r (v1) | 0:17 s | 0.8 | | parsermarker.r (v1.1) | 0:06 s | 0.5 | | clrs.r (v1) | 0:49 s | 1.9 | | clrs.r (v1.1) | 0:17 s | 0.7 |
- introduced the variables
nSamplesandnThreadsinsiderunner.sh. The first variable controls the number of samples to run in paralell and the second the per-samples number of threads.nSamplesguarantees a contant number of samples running in parallel; as soon as the count drop of one sample an other will start to run. The definition of these variables affect the scriptsminimap2.sh(which replacesbwa.sh),bcftools_markers.sh(which replacessamtools_marker.sh) andfreec.sh; - corrected an error that prevented the detection of the CNVs;
- Added a new approach for merging markers in blocks:
In v1 the markers are (1) genotyped, (2) filtered and (3) joined as long as they are consecutive and carry the same information. In v1.1 this does not change.
In v1.1 the markers are (1) ranked, (2) genotyped, (3) filtered, (4) joined as long as they are consecutive in the ranking and carry the same information. v1 did not use the ranking.
Inevitably, this results in a more fragmented signal but provides a more realistic and faithful representation of the introgression reflecting regions where the genotyping was either discordant or failed.
The ranking also represents the strategy that allowed the speedup of clrs.r (the script that generates the blocks).
Download
:octocat: :
sh
git clone --recursive https://github.com/nicolo-tellini/intropipeline.git
Content
:openfilefolder: :
```{bash} . ├── rep │ ├── Ann │ └── Asm ├── runner.sh ├── scr └── seq
5 directories 1 file ```
rep: repository with assemblies, annotations and pre-computed marker table,runner.sh: the script you edit and run,scr: scripts,seq: put the FASTQs files here,
Before starting
gzip -d ./rep/mrktab.gz
gzip -d ./rep/Asm/*gz
About the fastqs
Move the FASTQs inside ./seq/
Paired-end FASTQs data must be gziped and suffixed with .R1.fastq.gz and .R2.fastq.gz.
Default
./scr/bwa.sh uses 2 thread for sample (n.samples = 2).
./scr/samtools_markers.sh uses 1 thread for sample (n.samples = 4).
./scr/gem.sh uses 2 threads.
./scr/freec.sh uses 4 threads.
these values can be changed editing the scripts.
How to run
Edit runner.sh :pagewithcurl:
```{bash}
!/bin/bash
user settings
S. paradoxus reference assembly
ref2Label="CBS432" ## choose the Spar assembly you think better fit the origin of your samples
short labels (used to name file)
ref2="EU" ## choose a short name for Spar
STEP 1
fastqQC="yes" ## fastqc control (required) ("yes","no" or "-" the last is skip)
STEP 2
shortReadMapping="yes" ## ("yes","no")
STEP 3
mrkgeno="yes" ## ("yes","no")
STEP 4
cnv="yes" ## ("yes","no")
STEP 5
intro="yes" ## ("yes","no")
settings' end
```
Run runner.sh :runner:
{bash}
nohup bash runner.sh &
The result
The results concerning the introgressions are stored in ./int
Ex.
An Alpechin strain:
How to interprer the result
Blue-Red plots provides an overview of potential introgressed DNA across the genome. The interpretation of the results is a process that require the integration of different data the pipeline produces.
:exclamation: Reminder: blocks are defined as consecutive markers besring the same genomic info (Homo S.cer, Homo S.par, Het).
How are markers distributed inside the S.par block?
A couple of possible scenarious:
Case 1: abundant markers suporting the block
:exclamation: Note: Only a few markers in the figure above are represented in the cartoon;
Case 2: not so abundant markers suporting the block
:exclamation: Note: you should not exclude the possibility that a large events is supported by a low number of markers as in the example.
The number of markers supporting the blocks, the marker density and the info concerning the genotype are stored in int and int/AllSegments.
Dependencies
Softwares
- FastQC
- minimap2
- samtools
- bcftools
- GEM v. 1.315 (beta) !! The GEM version used for the analyses is 1.759 (not available anymore).
- Control-FREEC v. 11.6; makeGraph.R script was renamed makeplotcnv.R; A copy of all the scripts in FREEC/scripts/ is in scr. Nevertheless freec has to be installed
- A copy of sambamba v. 0.6.5 is provided with the pipeline (no installation required)
R libraries
- data.table
- ggplot2
- rtracklayer
- R.filesets
- GenomicRanges
- purrr
- dplyr
- R.utilis
Find out more
Marker definition Methods
Citations
Please cite this paper when using intropipeline for your publications.
Ancient and recent origins of shared polymorphisms in yeast Nicolò Tellini, Matteo De Chiara, Simone Mozzachiodi, Lorenzo Tattini, Chiara Vischioni, Elena S. Naumova, Jonas Warringer, Anders Bergström & Gianni Liti Nature Ecologya and Evolution, 2024, https://doi.org/10.1038/s41559-024-02352-5
@article{tellini2024ancient,
title={Ancient and recent origins of shared polymorphisms in yeast},
author={Tellini, Nicol{\`o} and De Chiara, Matteo and Mozzachiodi, Simone and Tattini, Lorenzo and Vischioni, Chiara and Naumova, Elena S and Warringer, Jonas and Bergstr{\"o}m, Anders and Liti, Gianni},
journal={Nature Ecology \& Evolution},
pages={1--16},
year={2024},
publisher={Nature Publishing Group UK London}
}
Release history
- v1.0 released in 2023
- v1.1 released in 2024
Owner
- Name: nt
- Login: nicolo-tellini
- Kind: user
- Twitter: nicolo_tellini
- Repositories: 5
- Profile: https://github.com/nicolo-tellini
Citation (CITATION.md)
### Citing
Please cite this paper when using intropipeline for your publications.
> Ancient and recent origins of shared polymorphisms in yeast </br>
> Nicolò Tellini, Matteo De Chiara, Simone Mozzachiodi, Lorenzo Tattini, Chiara Vischioni, Elena S. Naumova, Jonas Warringer, Anders Bergström & Gianni Liti </br>
> Nature Ecologya and Evolution, 2024, https://doi.org/10.1038/s41559-024-02352-5
```
@article{tellini2024ancient,
title={Ancient and recent origins of shared polymorphisms in yeast},
author={Tellini, Nicol{\`o} and De Chiara, Matteo and Mozzachiodi, Simone and Tattini, Lorenzo and Vischioni, Chiara and Naumova, Elena S and Warringer, Jonas and Bergstr{\"o}m, Anders and Liti, Gianni},
journal={Nature Ecology \& Evolution},
pages={1--16},
year={2024},
publisher={Nature Publishing Group UK London}
}
```
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1