https://github.com/cnr-ibba/nf-treeseq

Nextflow pipeline to convert genotype data into Tree Sequences

https://github.com/cnr-ibba/nf-treeseq

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.3%) to scientific vocabulary

Keywords

nextflow treeseq tskit
Last synced: 5 months ago · JSON representation

Repository

Nextflow pipeline to convert genotype data into Tree Sequences

Basic Info
  • Host: GitHub
  • Owner: cnr-ibba
  • License: mit
  • Language: Nextflow
  • Default Branch: master
  • Homepage:
  • Size: 2 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 7
  • Releases: 0
Topics
nextflow treeseq tskit
Created almost 2 years ago · Last pushed 10 months ago
Metadata Files
Readme License

README.md

nf-treeseq

A Nextflow pipeline for generating Tree Sequences from PLINK and VCF files.

Nextflow Run with Docker Run with Singularity

Background

This pipeline is designed to infer Tree Sequences from genotype data. It is currently tailored for PLINK genotype files, where all relevant samples are contained within a single file. The pipeline converts the PLINK file into a VCF file, corrects ALT/REF alleles, and checks chromosome sizes. It then uses Beagle to impute and phase any missing data before running tsinfer to create Tree Sequences from the VCF file.

About Ancestral Alleles

tsinfer requires ancestral alleles to generate tree sequence files. Currently, the pipeline supports three methods for determining ancestral alleles:

  1. Using the reference genome: The REF allele in the VCF file is used as the ancestral allele.
  2. Using est-sfs: This method estimates the site frequency spectrum and infers ancestral alleles. It requires the presence of outgroup samples (ancestral to the rest of the data) in the dataset.
  3. Using compara: This method requires an additional CSV file containing the ancestral alleles.

Getting the Pipeline

You can obtain this pipeline by cloning the GitHub repository:

bash git clone cnr-ibba/nf-treeseq

Alternatively, you can use the nextflow pull command:

bash nextflow pull cnr-ibba/nf-treeseq

For more information on installing and running Nextflow pipelines, including dealing with revisions, refer to the Nextflow documentation.

Usage

While all parameters can be passed via the command line, it is recommended to use a configuration file. The configuration file should be a simple JSON file containing at least the following parameters:

json { "plink_bfile": "<binary plink prefix>", "plink_species": "<plink species options>", "plink_keep": "<plink keep file>", "plink_geno": 0.1, "genome": "<genome file>" }

Explanation of Parameters:

  • plink_bfile: The binary PLINK file prefix used as the --bfile parameter.
  • plink_species: Species-specific options for PLINK (e.g., --species sheep or --chr-set 26 no-xy no-mt --allow-no-sex).
  • plink_keep: A TSV file with FID and IID columns indicating the samples to keep.
  • plink_geno: The PLINK --geno parameter (default: 0.1), which excludes SNPs with a higher missing rate.
  • genome: The genome file used by bcftools for allele normalization (setting ALT/REF alleles) and chromosome size correction.

Specifying Ancestral Alleles

The pipeline requires ancestral alleles to generate tree sequences. At least one of the following methods must be used to infer ancestral alleles:

1. Using the Reference Genome

To use the reference genome for inferring ancestral alleles, simply set the reference_ancestor flag:

json { "reference_ancestor": true }

2. Using est-sfs to Infer Ancestral Alleles

To infer ancestral alleles using est-sfs, enable the with_estsfs flag and specify one or more outgroup sample files (TSV format with FID and IID columns). You can provide up to three outgroup files:

json { "with_estsfs": true, "outgroup1": "<outgroup1 samples file>", "outgroup2": "<outgroup2 samples file>", "outgroup3": "<outgroup3 samples file>" }

3. Using compara to Infer Ancestral Alleles

To use compara for inferring ancestral alleles, provide a CSV file with the following format:

csv chrom,position,alleles,anc_allele 26,209049,A/G,C 26,268822,A/G,C 26,285471,A/G,G 26,361728,G/T,G

After generating the file, specify it using the compara_ancestor parameter:

json { "compara_ancestor": "<compara file>" }

Additional Parameters

Additional parameters can be set in the configuration file to control the pipeline or specify the output directory. To see all available options, run:

bash nextflow run cnr-ibba/nf-treeseq --help

For more advanced options, including hidden parameters:

bash nextflow run cnr-ibba/nf-treeseq --help --validationShowHiddenParams

Running the Pipeline

Once your configuration file is set up, run the pipeline with:

bash nextflow run cnr-ibba/nf-treeseq -profile <profile> -params-file <config.json>

  • <profile>: The execution environment profile (e.g., docker or singularity).
  • <config.json>: The configuration file you created.

You can also override specific parameters directly in the command line:

bash nextflow run cnr-ibba/nf-treeseq -profile singularity -params-file config.json --plink_geno 0.2

Owner

  • Name: CNR-IBBA
  • Login: cnr-ibba
  • Kind: organization
  • Location: Milan

Bioinformatic Group @ CNR-IBBA

GitHub Events

Total
  • Issues event: 2
  • Push event: 1
  • Pull request event: 2
  • Create event: 2
Last Year
  • Issues event: 2
  • Push event: 1
  • Pull request event: 2
  • Create event: 2