qtlight

eqtl analysis pipeline using tensorqtl, saige-qtl, LIMIX and jaxQTL

https://github.com/wtsi-hgi/qtlight

Science Score: 75.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
    Organization wtsi-hgi has institutional domain (www.sanger.ac.uk)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.1%) to scientific vocabulary
Last synced: 8 months ago · JSON representation ·

Repository

eqtl analysis pipeline using tensorqtl, saige-qtl, LIMIX and jaxQTL

Basic Info
  • Host: GitHub
  • Owner: wtsi-hgi
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 30.1 MB
Statistics
  • Stars: 12
  • Watchers: 3
  • Forks: 8
  • Open Issues: 0
  • Releases: 1
Created about 4 years ago · Last pushed 8 months ago
Metadata Files
Readme Changelog License Code of conduct Citation

README.md

Introduction

QTLight is a bioinformatics best-practice analysis pipeline for eqtl analysis with TensorQTL, SaigeQTL, LIMIX. It takes your vcf files (or pgen/bed) alongside flat quantification data (such as bulk RNAseq expression files, ATACseq qantification data, Splicing Quantification data) or a scRNA h5ad file and performs relevant QTL analysis.

This pipeline is running TensorQTL and/or LIMIX and/or jaxQTL on bulk and/or SAIGE-qtl on single cell RNA seq datasets and assessed the overlap of the eGenes identified by both methodologies. While TensorQTL is very fast, this methodology uses linear regression which may not be capable in adequately represent the underlying population structure and other covariates, whereas Limix, while very computationally intensive is based on the linear mixed models (LMM) where the kinship matrices can be provided and hence accounting for random effects in a better manner.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible.

TensorQTL

SaigeQTL

LIMIX

jaxQTL

Pipeline summary

  1. Genotype preperation, filtering and subsetting (bcftools)
  2. Genotype conversion to PLINK format and filtering (PLINK2)
  3. Genotype kinship matrix calculation (PLINK2)
  4. Genotype and Phenotype PC calculation and QTL mapping with various number of PCs (PLINK2)
  5. LIMIX eqtl mapping (LIMIX)
  6. TensorQTL qtl mapping (TensorQTL)
  7. SAIGE-QTL mapping (SAIGE-QTL)
  8. jaxQTL mapping (jaxQTL)

Quick Start

  1. Install Nextflow (>=21.04.0)

  2. Install any of Docker, Singularity

  3. Download the pipeline and test it on a minimal dataset with a single command:

    console nextflow run /path/to/cloned/QTLight -profile test_bulk,<docker/singularity/institute>

  4. Prepeare the input.nf parameters file: ```console params { method = 'singlecell' // Options: 'singlecell' or 'bulk' // - If 'singlecell': phenotypefile must be a .h5ad file (AnnData object) // - If 'bulk': phenotype_file should point to raw count matrices (e.g., STAR/featureCounts outputs)

        input_vcf = false
        // Optional if using preprocessed genotypes.
        // Leave as false or empty if providing one of:
        //   - params.genotypes.preprocessed_pgen_file
        //   - params.genotypes.preprocessed_bed_file
        //   - params.genotypes.preprocessed_bgen_file
    
        genotype_phenotype_mapping_file = '/path/to/geno_pheno_mapping.tsv'
        // Required. TSV file with:
        //   [Genotype_ID    Phenotype_ID    Sample_Category]
        // - Genotype_ID: must match PLINK IID (in .psam/.fam/.pvar)
        // - Phenotype_ID: must match sample ID in h5ad `.obs`
        // - Sample_Category: optional grouping label (e.g., 'default', 'stimA')
    
        annotation_file = '/path/to/annotation.gtf'
        // Required. Gene annotation in GTF format OR custom 4-column TSV:
        //   [feature_id  start  end  chromosome]
        // The coordinate used (TSS vs midpoint) is controlled by `position`
    
        phenotype_file = '/path/to/input_expression.h5ad'
        // For 'single_cell': must be an .h5ad file with raw or normalized counts
        // For 'bulk': a gene expression matrix (TSV)
    
        aggregation_columns = 'cell_type'
        // Comma-separated column(s) in `.obs` used for pseudobulk aggregation
        // E.g., 'cell_type', 'Azimuth:predicted.celltype.l2'
    
        aggregation_subentry = ''
        // Optional. If provided, restricts analysis to these sublevels within aggregation_columns
        // E.g., 'Mono,B,Platelet'
    
        aggregation_method = 'dMean,dSum'
        // Aggregation methods to apply: dMean = average expression, dSum = summed counts
        // Can provide both, comma-separated
    
        split_aggregation_adata = true
        // Whether to split .h5ad by Sample_Category before aggregating
    
        gt_id_column = 'Vacutainer ID'
        // Column in `.obs` with the **donor/genotype ID**.
        // Must match the VCF/PLINK ID or the `RNA` column in the genotype–phenotype mapping file.
    
        sample_column = 'pheno_id'
        // Column in `.obs` with the **sample/library ID**.
        // Distinguishes multiple measurements from the same donor.
        // Can be the same as `gt_id_column` if each sample maps to one donor.
    
        norm_method = 'NONE'
        // Normalisation strategy for bulk datasets: DESEQ | TMM | NONE
    
        dMean_norm_method = 'cp10k'
        // Normalization method to apply before dMean aggregation.
        // Options:
        //   - 'cp10k'         : Total-count normalize to 10,000 UMIs/cell, then log1p
        //   - 'pf_log1p_pf'   : Pseudofactor normalization → log1p → pseudofactor again
        //   - 'NONE'          : No normalization; original file passed through unchanged
    
        //
        // Notes:
        // - Raw count matrix is expected to be in `adata.X` or `adata.layers['counts']`
        // - If not present, the pipeline assumes `adata.X` is raw and warns the user
    
        filter_method = 'None'
        // Gene filtering strategy before PCA/QTL: HVG | filterByExpr | None
    
        inverse_normal_transform = 'FALSE'
        // Whether to apply inverse normal transform post-normalisation
    
        windowSize = 500000
        // Window size (+/- bp) around gene TSS or midpoint for cis-QTL
    
        percent_of_population_expressed = 0.05
        // Minimum fraction of individuals in which gene must be expressed
    
       inverse_normal_transform = 'FALSE'
        // Apply inverse normal transformation to data after normalization (if TRUE)
    
        n_min_cells = '5'
        // Minimum cells per individual per celltype to include in QTL
    
        n_min_individ = '25'
        // Minimum individuals with valid expression to include gene
    
        maf = 0.01
        hwe = 0.000001
        numberOfPermutations = 1000
    
        covariates {
            nr_phenotype_pcs = '2,4' 
            // Comma-separated values. Each entry defines how many phenotype PCs to use per model.
    
            nr_genotype_pcs = 4 
            // Number of genotype PCs to include in the model for population structure correction.
    
            genotype_pc_filters = '--indep-pairwise 50 5 0.2'
            // PLINK2 parameters used to calculate genotype PCs if not provided.
    
            genotype_pcs_file = ''
            // Optional. Path to precomputed genotype PCs (TSV)
            // Format: rows = PC names, columns = sample IDs (must match .psam IIDs)
            // Ensure it includes at least `nr_genotype_pcs` components.
    
            extra_covariates_file = ''
            // Optional. Path to a TSV file with additional covariates (numeric only!)
            // These will be added to the model along with PCs.
            //
            // Format:
            //     covariate   S1   S2   S3 ...
            //     Age         35   40   29
            //     BMI         22   27   24
            //
            // - First column: covariate names
            // - First row: header with sample IDs (must match genotype IIDs)
            // - All values must be strictly numeric (no categories, booleans, or NA)
            // - Missing values are not allowed — impute or remove samples upstream.
        }
    
        genotypes {
            subset_genotypes_to_available = false
            // If true: subset genotype data to only individuals found in expression data
            // (useful for large genotype datasets)
    
            use_gt_dosage = true
            // If true: use genotype dosages (DS field in VCF or PGEN format)
            // If false: use hard-called genotypes (GT field from VCF or PLINK BED)
    
            preprocessed_pgen_file = '/path/to/pgen_dir/'
            // Path to directory containing a PLINK2 dataset: .pgen, .psam, .pvar
            // This should be a clean folder with only one PLINK2 trio.
    
            preprocessed_bed_file = ''
            // Optional: path to PLINK1 dataset (BED format)
            // Folder should contain matching .bed, .bim, .fam
    
            preprocessed_bgen_file = ''
            // Optional: path to BGEN file (for LIMIX only)
            // Must include .bgen, .sample, and .bgi index
        }
    }
    

    ``` example genotypephenotypemappingfile |Genotype |RNA |SampleCategory| |-----------------|----------|-------------------| |HPSI0713i-aehn22| MMoxLDL7159503| M0Ctrl| |HPSI0713i-aehn22| MMoxLDL7159504| M0oxLDL| |HPSI0713i-aehn22 |MMoxLDL7159505 |M1_oxLDL|

  5. Start running your own analysis!

    console nextflow run /path/to/cloned/QTLight -profile sanger -resume -c input.nf

Documentation

The nf-core/eqtl pipeline comes with documentation about the pipeline usage and output.

Credits

QTLight was developed by Matiss Ozols, Tobi Alegbe, Marc Jan Bonder, Hannes Ponstingl, Bradley Harris, Haerin Jang, Vivek Iyer, Nicole Soranzo. <!--

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #eqtl channel (you can join with this invite). -->

Citations

If you use nf-core/eqtl for your analysis, please cite it using the following doi: 10.5281/zenodo.15601494

Ozols, M. et al. QTLight (Quantitative Trait Loci mapping pipeline): GitHub. https://github.com/wtsi-hgi/QTLight. DOI

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Owner

  • Name: Wellcome Trust Sanger Institute - Human Genetics Informatics
  • Login: wtsi-hgi
  • Kind: organization
  • Email: hgi@sanger.ac.uk
  • Location: Cambridge, UK

Analysing genomic data at scale for the Human Genetics Program

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this pipeline, please cite it as below."
title: "QTLight: Quantitative Trait Loci mapping pipeline"
version: "v1.61"
date-released: 2025-05-19
repository-code: https://github.com/wtsi-hgi/QTLight
url: https://github.com/wtsi-hgi/QTLight
license: GPL-3.0
type: pipeline
doi: 10.5281/zenodo.15601494 
authors:
  - family-names: Ozols
    given-names: Matiss
    orcid: https://orcid.org/0000-0001-5663-1053
  - family-names: Cuomo
    given-names: Anna
  - family-names: Bonder
    given-names: Marc Jan
  - family-names: Ponstingl
    given-names: Hannes
  - family-names: Alegbe
    given-names: Tobi
  - family-names: Harris
    given-names: Bradley
  - family-names: Jang
    given-names: Haerin
  - family-names: Iyer
    given-names: Vivek
  - family-names: Soranzo
    given-names: Nicole
abstract: >
  QTLight is a flexible and reproducible pipeline for population-scale QTL mapping 
  using bulk and single-cell data. It integrates TensorQTL, LIMIX, and SAIGE-QTL methods 
  for eQTL and other molecular trait analysis. The pipeline supports input from VCF or PLINK 
  formats, incorporates kinship matrices and covariates, and is compatible with both flat and h5ad-formatted phenotypes.
  Built with Nextflow DSL2, QTLight is modular, scalable, and containerized for HPC or cloud execution.

GitHub Events

Total
  • Release event: 1
  • Watch event: 4
  • Delete event: 4
  • Issue comment event: 1
  • Push event: 129
  • Pull request review event: 1
  • Pull request event: 10
  • Fork event: 2
  • Create event: 8
Last Year
  • Release event: 1
  • Watch event: 4
  • Delete event: 4
  • Issue comment event: 1
  • Push event: 129
  • Pull request review event: 1
  • Pull request event: 10
  • Fork event: 2
  • Create event: 8

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 0
  • Total pull requests: 13
  • Average time to close issues: N/A
  • Average time to close pull requests: 17 days
  • Total issue authors: 0
  • Total pull request authors: 4
  • Average comments per issue: 0
  • Average comments per pull request: 0.15
  • Merged pull requests: 10
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 9
  • Average time to close issues: N/A
  • Average time to close pull requests: 3 days
  • Issue authors: 0
  • Pull request authors: 3
  • Average comments per issue: 0
  • Average comments per pull request: 0.11
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • Tobi1kenobi (10)
  • jsharrison94 (2)
  • nickhir (2)
  • maxozo (2)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

modules/nf-core/modules/fastqc/meta.yml cpan
modules/nf-core/modules/multiqc/meta.yml cpan
assets/eqtl_container/Dockerfile docker
  • ubuntu latest build
docs/container_setup/eqtl/Dockerfile docker
  • ubuntu latest build