deep-pileup-wrapper

REMIND-Cancer QC for Checking Individual Genomic Positions

https://github.com/nicholas-abad/deep-pileup-wrapper

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

REMIND-Cancer QC for Checking Individual Genomic Positions

Basic Info

Host: GitHub
Owner: nicholas-abad
Language: Python
Default Branch: main
Homepage: https://www.biorxiv.org/content/10.1101/2024.06.03.597231v1
Size: 5.01 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme Citation

Deep Pileup

This repository contains a pipeline to perform deep pileup analyses on variant positions across tumor and control BAM files and generate summary statistics and visualizations per gene and genomic position. This helps in identifying problematic positions.

Example

An example of a technically valid genomic position can be seen in (A) whereas a noisy position and potential artifact can seen in (B):

Overview

The pipeline is designed to work in high-performance computing environments (e.g., DKFZ LSF cluster) and supports automated pileup extraction using samtools, downstream aggregation, variant allele frequency (VAF) analysis, and visualization.

Key Features

Automatic pileup calling using samtools on user-provided BAM files.
Flexible metadata input to define control and tumor BAM file paths for each cohort.
Custom filtering thresholds:
- minFreq: minimum frequency to consider a variant (default: 0.9)
- minAbs: absolute read count threshold (default: 2)
- minHetSnp: minimum second allele frequency for heterozygosity (default: 0.25)
Visual summaries including:
- Allele frequency > 25% across cohorts
- At least 2 supporting reads per base (A/C/G/T)
Support for LSF cluster job submission via bsub
Modular structure for reusability of key components

Usage

Step 1: Prepare Input Files

A metadata CSV file with the following columns:
- pid, cohort, path_to_control_bam, path_to_tumor_bam
A VCF-like file with columns:
- GENE, #CHROM, POS

Step 2: Run the Pipeline

Use main.py to iterate through a VCF file and call deep_pileup_single_position.py on each site.

bash python main.py \ --path-to-vcf path/to/input.vcf \ --path-to-metadata path/to/metadata.csv \ --output-path-to-repository ./deep_pileup_output \ --path-to-dp-single-position-script ./_deep_pileup_single_position.py \ --dkfz-cluster True

Optional arguments:

--starting-index and --ending-index: subset range for variants
--dkfz-cluster: whether to submit jobs via LSF (True or False)

Step 3: Output

For each position, the following is generated:

pileup_control_<cohort>.txt, pileup_tumor_<cohort>.txt
Overview.tsv: summary statistics for all cohorts
Visualizations:
- af_greater_than_25_only_relevant.png
- af_greater_than_25_all.png
- at_least_two_variant_alleles_only_relevant.png
- at_least_two_variant_alleles_all.png

How It Works

For each genomic position provided:

The script loads the metadata file and filters out any entries that do not have associated tumor or control BAM files.
For each valid cohort:
- It calls samtools view to extract reads from the specified genomic position in each BAM file.
- The resulting reads are piped into samtools mpileup to generate base-level information (pileup).
- Only the lines corresponding to the target position are retained via grep.
These pileup outputs are saved into pileup_control_<cohort>.txt and pileup_tumor_<cohort>.txt files.
A summary file (Overview.tsv) is generated from all pileup files to report:
- Total reads and base distribution (A/C/G/T)
- SNPs with allele frequency > specified thresholds
- Positions where a base is supported by ≥2 reads
Two sets of plots are generated to visualize:
- Minor allele frequencies greater than 25%
- Variant bases observed at least twice

Dependencies

Python 3.11.12
Python packages can be downloaded in requirements.txt
Samtools 1.14 (https://github.com/samtools/samtools/releases/tag/1.14)
- Though later versions exist, this was originally developed using Samtools 1.14

Contact

For questions or support, contact: nicholas.a.abad@gmail.com

Owner

Login: nicholas-abad
Kind: user
Location: Heidelberg, Germany

Website: https://www.linkedin.com/in/nicholasabad/
Repositories: 7
Profile: https://github.com/nicholas-abad

Machine Learning / Bioinformatics PhD Student at the DKFZ (German Cancer Research Institute)

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you'd like to cite DeepPileup, please use the following citation. However, if you'd like to cite the results, please cite the paper."
authors:
  - family-names: Abad
    given-names: Nicholas
    orcid: https://orcid.org/0009-0004-8322-564X
title: "DeepPileup"
version: 1.0
identifiers:
  - type: doi
    value: https://www.biorxiv.org/content/10.1101/2024.06.03.597231v1
date-released: 2024-04-24

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science