https://github.com/cnr-ibba/nf-treeseq
Nextflow pipeline to convert genotype data into Tree Sequences
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary
Keywords
Repository
Nextflow pipeline to convert genotype data into Tree Sequences
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 7
- Releases: 0
Topics
Metadata Files
README.md
nf-treeseq
A Nextflow pipeline for generating Tree Sequences from PLINK and VCF files.
Background
This pipeline is designed to infer Tree Sequences from genotype data. It is currently tailored for PLINK genotype files, where all relevant samples are contained within a single file. The pipeline converts the PLINK file into a VCF file, corrects ALT/REF alleles, and checks chromosome sizes. It then uses Beagle to impute and phase any missing data before running tsinfer to create Tree Sequences from the VCF file.
About Ancestral Alleles
tsinfer requires ancestral alleles to generate tree sequence files. Currently, the pipeline supports three methods for determining ancestral alleles:
- Using the reference genome: The REF allele in the VCF file is used as the ancestral allele.
- Using
est-sfs: This method estimates the site frequency spectrum and infers ancestral alleles. It requires the presence of outgroup samples (ancestral to the rest of the data) in the dataset. - Using
compara: This method requires an additional CSV file containing the ancestral alleles.
Getting the Pipeline
You can obtain this pipeline by cloning the GitHub repository:
bash
git clone cnr-ibba/nf-treeseq
Alternatively, you can use the nextflow pull command:
bash
nextflow pull cnr-ibba/nf-treeseq
For more information on installing and running Nextflow pipelines, including dealing with revisions, refer to the Nextflow documentation.
Usage
While all parameters can be passed via the command line, it is recommended to use a configuration file. The configuration file should be a simple JSON file containing at least the following parameters:
json
{
"plink_bfile": "<binary plink prefix>",
"plink_species": "<plink species options>",
"plink_keep": "<plink keep file>",
"plink_geno": 0.1,
"genome": "<genome file>"
}
Explanation of Parameters:
plink_bfile: The binary PLINK file prefix used as the--bfileparameter.plink_species: Species-specific options for PLINK (e.g.,--species sheepor--chr-set 26 no-xy no-mt --allow-no-sex).plink_keep: A TSV file withFIDandIIDcolumns indicating the samples to keep.plink_geno: The PLINK--genoparameter (default: 0.1), which excludes SNPs with a higher missing rate.genome: The genome file used bybcftoolsfor allele normalization (setting ALT/REF alleles) and chromosome size correction.
Specifying Ancestral Alleles
The pipeline requires ancestral alleles to generate tree sequences. At least one of the following methods must be used to infer ancestral alleles:
1. Using the Reference Genome
To use the reference genome for inferring ancestral alleles, simply set the reference_ancestor flag:
json
{
"reference_ancestor": true
}
2. Using est-sfs to Infer Ancestral Alleles
To infer ancestral alleles using est-sfs, enable the with_estsfs flag and specify one or more outgroup sample files (TSV format with FID and IID columns). You can provide up to three outgroup files:
json
{
"with_estsfs": true,
"outgroup1": "<outgroup1 samples file>",
"outgroup2": "<outgroup2 samples file>",
"outgroup3": "<outgroup3 samples file>"
}
3. Using compara to Infer Ancestral Alleles
To use compara for inferring ancestral alleles, provide a CSV file with the following format:
csv
chrom,position,alleles,anc_allele
26,209049,A/G,C
26,268822,A/G,C
26,285471,A/G,G
26,361728,G/T,G
After generating the file, specify it using the compara_ancestor parameter:
json
{
"compara_ancestor": "<compara file>"
}
Additional Parameters
Additional parameters can be set in the configuration file to control the pipeline or specify the output directory. To see all available options, run:
bash
nextflow run cnr-ibba/nf-treeseq --help
For more advanced options, including hidden parameters:
bash
nextflow run cnr-ibba/nf-treeseq --help --validationShowHiddenParams
Running the Pipeline
Once your configuration file is set up, run the pipeline with:
bash
nextflow run cnr-ibba/nf-treeseq -profile <profile> -params-file <config.json>
<profile>: The execution environment profile (e.g.,dockerorsingularity).<config.json>: The configuration file you created.
You can also override specific parameters directly in the command line:
bash
nextflow run cnr-ibba/nf-treeseq -profile singularity -params-file config.json --plink_geno 0.2
Owner
- Name: CNR-IBBA
- Login: cnr-ibba
- Kind: organization
- Location: Milan
- Website: https://www.ibba.cnr.it/
- Repositories: 25
- Profile: https://github.com/cnr-ibba
Bioinformatic Group @ CNR-IBBA
GitHub Events
Total
- Issues event: 2
- Push event: 1
- Pull request event: 2
- Create event: 2
Last Year
- Issues event: 2
- Push event: 1
- Pull request event: 2
- Create event: 2