https://github.com/biodataanalysisgroup/synth4bench

A framework for generating synthetic genomics data for the evaluation of tumor-only somatic variant calling algorithms.

https://github.com/biodataanalysisgroup/synth4bench

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 12 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.2%) to scientific vocabulary

Keywords

benchmarking bioinformatics synthetic-dataset-generation variant-calling
Last synced: 5 months ago · JSON representation

Repository

A framework for generating synthetic genomics data for the evaluation of tumor-only somatic variant calling algorithms.

Basic Info
  • Host: GitHub
  • Owner: BiodataAnalysisGroup
  • License: mit
  • Language: R
  • Default Branch: main
  • Homepage:
  • Size: 242 MB
Statistics
  • Stars: 4
  • Watchers: 0
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Fork of sfragkoul/synth4bench
Topics
benchmarking bioinformatics synthetic-dataset-generation variant-calling
Created over 2 years ago · Last pushed 8 months ago
Metadata Files
Readme License

README.md

synth4bench logo


Abstract

Somatic variant calling algorithms are essential for detecting genomic alterations associated with cancer. However, evaluating their performance can be challenging due to the lack of high-quality ground truth datasets. To address this issue, we developed a synthetic genomics data generation framework for benchmarking tumor-only somatic variant calling algorithms. We generated synthetic datasets based on the TP53 gene using the NEAT v3.3 simulator. Subsequently, we thoroughly evaluated the performance of variant calling algorithms including GATK-Mutect2, Freebayes, VarDict, VarScan, and LoFreq on these datasets, comparing results against the ground truth produced by NEAT. Synthetic datasets provide an excellent ground truth for studying the performance and behavior of somatic variant calling algorithms, enabling researchers to evaluate and improve their accuracy for cancer genomics applications.

Table of Contents


Motivation

Variant calling plays a critical role in identifying genetic lesions. In the case of low-frequency variants (≤10%), identification becomes more challenging due to the absence of ground truth datasets for reliable and consistent benchmarking.

Description of Framework

synth4bench schematic

Our framework addresses the challenge of variant calling, particularly for low-frequency variants (≤10%). The goal is to develop a reliable and consistent method for identifying genetic lesions in cancer-associated genomic alterations. The lack of ground truth datasets complicates benchmarking and evaluation. To overcome this, our framework includes: 1. **Synthetic Data Generation:** Using the NEAT v3.3 simulator, we generate synthetic genomics data that mimics real genome sequences, serving as a ground truth. 2. **Benchmarking Variant Callers:** We evaluate five somatic variant callers — GATK-Mutect2, Freebayes, VarDict, VarScan2, and LoFreq — using these synthetic datasets.

Data Download

All data are openly available on Zenodo. For specific instructions, refer to our User Guide.


Installation

  1. Create the Conda environment:

    bash conda env create -f environment.yml conda activate synth4bench

  2. Install NEAT v3.3:

    Download version v3.3.
    To call the main script:

    bash python gen_reads.py --help

    For further details, see the NEAT README included in the download.

  3. Install bam-readcount:

    Follow their installation instructions.
    After building, verify installation:

    bash build/bin/bam-readcount --help

    If you encounter issues during the make process, you can alternatively use the executable available here and place it in the bam-readcount/build/bin folder.

  4. Download VarScan Extra Script:

    The extra script vscan_pileup2cns2vcf.py for VarScan is available here.


Execution

Simply configure your parameters in the parameters.yaml file, then execute:

bash bash s4b_run.sh

This single command generates synthetic data, runs variant calling for all selected tools, and performs downstream analysis and plotting.

For full execution instructions, see our User Guide.


Documentation

For further documentation, visit the documentation page.


Contribute

We welcome and greatly appreciate any feedback or contributions!

If you have questions, please open an issue here or email sfragkoul@certh.gr.


Citation

Our work has been submitted to the bioRxiv preprint repository. If you use synth4bench, please cite:

S.-C. Fragkouli, N. Pechlivanis, A. Anastasiadou, G. Karakatsoulis, A. Orfanou, P. Kollia, A. Agathangelidis, and F. E. Psomopoulos, “Synth4bench: a framework for generating synthetic genomics data for the evaluation of tumor-only somatic variant calling algorithms.” 2024, doi:10.1101/2024.03.07.582313.


Related Publications

  • S.-C. Fragkouli, N. Pechlivanis, A. Anastasiadou, G. Karakatsoulis, A. Orfanou, P. Kollia, A. Agathangelidis, and F. Psomopoulos, synth4bench: Benchmarking Somatic Variant Callers – A Tale Unfolding In The Synthetic Genomics Feature Space, 23rd European Conference On Computational Biology (ECCB24), Sep 2024, Turku, Finland, doi: 10.5281/zenodo.14186509
  • S.-C. Fragkouli, N. Pechlivanis, A. Anastasiadou, G. Karakatsoulis, A. Orfanou, P. Kollia, A. Agathangelidis, and F. Psomopoulos, “Exploring Somatic Variant Callers' Behavior: A Synthetic Genomics Feature Space Approach”, ELIXIR AHM24, Jun 2024, Uppsala, Sweden, doi: 10.7490/f1000research.1119793.1
  • S.-C. Fragkouli, N. Pechlivanis, A. Orfanou, A. Anastasiadou, A. Agathangelidis and F. Psomopoulos, Synth4bench: a framework for generating synthetic genomics data for the evaluation of somatic variant calling algorithms, 17th Conference of Hellenic Society for Computational Biology and Bioinformatics (HSCBB), Oct 2023, Thessaloniki, Greece, doi:10.5281/zenodo.8432060
  • S.-C. Fragkouli, N. Pechlivanis, A. Agathangelidis and F. Psomopoulos, Synthetic Genomics Data Generation and Evaluation for the Use Case of Benchmarking Somatic Variant Calling Algorithms, 31st Conference in Intelligent Systems For Molecular Biology and the 22nd European Conference On Computational Biology (ISΜB-ECCB23), Jul 2023, Lyon, France, doi:10.7490/f1000research.1119575.1

Owner

  • Name: Biodata Analysis Group
  • Login: BiodataAnalysisGroup
  • Kind: organization
  • Email: fpsom@certh.gr

GitHub Events

Total
  • Issues event: 6
  • Delete event: 2
  • Issue comment event: 1
  • Push event: 78
  • Pull request event: 27
  • Fork event: 1
  • Create event: 1
Last Year
  • Issues event: 6
  • Delete event: 2
  • Issue comment event: 1
  • Push event: 78
  • Pull request event: 27
  • Fork event: 1
  • Create event: 1

Dependencies

environment.yml pypi
  • biopython ==1.80
  • contourpy ==1.0.6
  • cycler ==0.11.0
  • fonttools ==4.38.0
  • kiwisolver ==1.4.4
  • matplotlib ==3.6.2
  • matplotlib-venn ==0.11.7
  • numpy ==1.24.0
  • packaging ==22.0
  • pandas ==1.5.2
  • pillow ==9.3.0
  • pyparsing ==3.0.9
  • pysam ==0.19.1
  • python-dateutil ==2.8.2
  • pytz ==2022.7
  • scipy ==1.9.3
  • six ==1.16.0