parsec
Variant calling and imputation from low coverage sequencing data
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 7 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary
Repository
Variant calling and imputation from low coverage sequencing data
Basic Info
- Host: GitHub
- Owner: cguyomar
- License: mit
- Language: Nextflow
- Default Branch: main
- Size: 2.63 MB
Statistics
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
Introduction
PARSEC is a bioinformatics pipeline designed to genotype large populations using low coverage (typically <3X) sequencing data.
Three imputation software are available as of today :
- Stitch
- Glimpse1
- Beagle4
The pipeline is still in development, please reach out if you need help for running it or encounter bugs

- Index bams (
SAMtools) - Prepare fixed size genomic chunks (
bedtools) - Optionnal : call variants from sparse data
- Impute genotypes (
stitch - Index vcf (
Tabix) - Concatenate vcf files (
bcftools) - Sort vcf (
bcftools)
Usage
Inputs
Aligned reads
PARSEC takes bam files as input. It is advised to perform duplicated marking and eventually BQSR recalibration on the bam files.
Sarek can be used to automate read alignement (by not specifying any calling tool and eventually adding --skip_tools baserecalibrator)
Reference panel
Depending on the imputation method used, it may be necessary to supply a set of (preferentially phased) known variants (aka reference panel)
- Glimpse requires a reference panel supplied with --ref_panel
- Beagle can take a reference panel supplied with --ref_panel as a facultative input
- Stitch does not require a reference panel, but requires a set of SNP position, supplied as a vcf file (genotypes are not used). PARSEC can build it automatically using the calling subworkflow, but it is advised to validate it before running imputation. A good practice would be to run PARSEC a first time with --skip_imputation, hard filter the obtained variants, and run a second PARSEC run with --sparse_variants with the output of the first run
| Tool | VCF Reference panel supplied with --ref_panel | VCF of SNPs supplied with --sparse_variants |
|-------------|-------------------------------------------------|----------------------------------------------------------------|
| Glimpse | ✅ Mandatory | ❌ Not applicable |
| Beagle | ⚠️ Facultative | ❌ Not applicable |
| Stitch | ❌ Not applicable | ✅ Recommended (hard-filtered output of a PARSEC calling run) |
PARSEC main parameters
default nextflow/nf-core parameters are omitted
Define where the pipeline should find input data and save output data.
| Parameter | Description | Type | Default | Required |
|-----------|-----------|-----------|-----------|-----------|
| bam | glob for input bams | string | | True |
| ref_panel | Reference panel VCF | string | | |
| sparse_variants | VCF of variable positions used in stitch | string | | |
| imputation_tool | Imputation tool (stitch, beagl4 or glimpse) | string | stitch | True |
| fasta | Path to FASTA genome file. Help
This parameter is mandatory if --genome is not specified. If you don't have a BWA
index available this will be generated for you automatically. Combine with --save_reference to save BWA index for future runs.string | |
| genome_subset | bed file specifying a genomic region to analyze | string | | |
| window_size | Size of genomic windows used for parrallelization | integer | 1000000 | |
| buffer_size | Length of overlap between windows. A minimal overlap is required for good imputation | integer | 100000 | |
| ngen | Stitch parameter - number of iterations | integer | 1000 | |
| npop | Stitch parameter - number of haplotypes | integer | 10 | |
| skip_imputation | Stop after calling/genotyping | boolean | | | *
Example usage
Note If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with
-profile testbefore running the workflow on actual data. ( not yet available for PARSEC)
I have a reference panel
bash
nextflow run cguyomar/PARSEC \
-profile <docker/singularity/.../institute> \
--bam "/path/to/data/*.bam" \
--fasta genome.fa \
--ref_panel my_panel.vcf.gz \
--imputation_tool glimpse \ # or beagle4
--outdir <OUTDIR>
I don't have a reference panel - Stitch
PARSEC will perform a rough SNP calling on your data (using bcftools mpileup) and then used the detected positions for stitch imputation
bash
nextflow run cguyomar/PARSEC \
-profile <docker/singularity/.../institute> \
--bam "/path/to/data/*.bam" \
--fasta genome.fa \
--imputation_tool stitch \ # or beagle4
--outdir <OUTDIR>
A preferred approach is to do perform some user-defined hard filtering on the mpileup, first using option --skip_imputation, and then filtering the variant calling output (using for instance QUAL, or some external validation set).
An imputation PARSEC run can then be run using the filtered variants as a primer :
bash
nextflow run cguyomar/PARSEC \
-profile <docker/singularity/.../institute> \
--bam "/path/to/data/*.bam" \
--sparse_variants /res/of/previous/run/filtered.vcf.gz \
--fasta genome.fa \
--imputation_tool stitch \
--outdir <OUTDIR>
I don't have a reference panel - Beagle
bash
nextflow run cguyomar/PARSEC \
-profile <docker/singularity/.../institute> \
--bam "/path/to/data/*.bam" \
--fasta genome.fa \
--imputation_tool beagle4
--outdir <OUTDIR>
Warning: Please provide pipeline parameters via the CLI or Nextflow
-params-fileoption. Custom config files including those provided by the-cNextflow option can be used to provide any configuration except for parameters; see docs.
Credits
PARSEC was originally written in INRAE GenPhyse by Cervin Guyomar.
Contributions and Support
If you would like to contribute to this pipeline, please see the contributing guidelines.
Citations
An extensive list of references fornf-core pipelines bump-version 0.1.0 the tools used by the pipeline can be found in the CITATIONS.md file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x. nf-core pipelines bump-version 0.1.0
Owner
- Login: cguyomar
- Kind: user
- Repositories: 3
- Profile: https://github.com/cguyomar
Citation (CITATIONS.md)
# cguyomar/parsec: Citations ## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/) > Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. ## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/) > Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. ## Pipeline tools ## Software packaging/containerisation tools - [Anaconda](https://anaconda.com) > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. - [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/) > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. - [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/) > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. - [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241) > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241. - [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/) > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.
GitHub Events
Total
- Release event: 1
- Watch event: 1
- Delete event: 2
- Push event: 7
- Pull request event: 2
- Create event: 6
Last Year
- Release event: 1
- Watch event: 1
- Delete event: 2
- Push event: 7
- Pull request event: 2
- Create event: 6