deep-pileup-wrapper
REMIND-Cancer QC for Checking Individual Genomic Positions
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary
Repository
REMIND-Cancer QC for Checking Individual Genomic Positions
Basic Info
- Host: GitHub
- Owner: nicholas-abad
- Language: Python
- Default Branch: main
- Homepage: https://www.biorxiv.org/content/10.1101/2024.06.03.597231v1
- Size: 5.01 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Deep Pileup
This repository contains a pipeline to perform deep pileup analyses on variant positions across tumor and control BAM files and generate summary statistics and visualizations per gene and genomic position. This helps in identifying problematic positions.
Example
An example of a technically valid genomic position can be seen in (A) whereas a noisy position and potential artifact can seen in (B):

Overview
The pipeline is designed to work in high-performance computing environments (e.g., DKFZ LSF cluster) and supports automated pileup extraction using samtools, downstream aggregation, variant allele frequency (VAF) analysis, and visualization.
Key Features
- Automatic pileup calling using
samtoolson user-provided BAM files. - Flexible metadata input to define control and tumor BAM file paths for each cohort.
- Custom filtering thresholds:
minFreq: minimum frequency to consider a variant (default: 0.9)minAbs: absolute read count threshold (default: 2)minHetSnp: minimum second allele frequency for heterozygosity (default: 0.25)
- Visual summaries including:
- Allele frequency > 25% across cohorts
- At least 2 supporting reads per base (A/C/G/T)
- Support for LSF cluster job submission via
bsub - Modular structure for reusability of key components
Usage
Step 1: Prepare Input Files
- A metadata CSV file with the following columns:
pid,cohort,path_to_control_bam,path_to_tumor_bam
- A VCF-like file with columns:
GENE,#CHROM,POS
Step 2: Run the Pipeline
Use main.py to iterate through a VCF file and call deep_pileup_single_position.py on each site.
bash
python main.py \
--path-to-vcf path/to/input.vcf \
--path-to-metadata path/to/metadata.csv \
--output-path-to-repository ./deep_pileup_output \
--path-to-dp-single-position-script ./_deep_pileup_single_position.py \
--dkfz-cluster True
Optional arguments:
--starting-indexand--ending-index: subset range for variants--dkfz-cluster: whether to submit jobs via LSF (TrueorFalse)
Step 3: Output
For each position, the following is generated:
pileup_control_<cohort>.txt,pileup_tumor_<cohort>.txtOverview.tsv: summary statistics for all cohorts- Visualizations:
af_greater_than_25_only_relevant.pngaf_greater_than_25_all.pngat_least_two_variant_alleles_only_relevant.pngat_least_two_variant_alleles_all.png
How It Works
For each genomic position provided:
- The script loads the metadata file and filters out any entries that do not have associated tumor or control BAM files.
- For each valid cohort:
- It calls
samtools viewto extract reads from the specified genomic position in each BAM file. - The resulting reads are piped into
samtools mpileupto generate base-level information (pileup). - Only the lines corresponding to the target position are retained via
grep.
- It calls
- These pileup outputs are saved into
pileup_control_<cohort>.txtandpileup_tumor_<cohort>.txtfiles. - A summary file (
Overview.tsv) is generated from all pileup files to report:- Total reads and base distribution (A/C/G/T)
- SNPs with allele frequency > specified thresholds
- Positions where a base is supported by ≥2 reads
- Two sets of plots are generated to visualize:
- Minor allele frequencies greater than 25%
- Variant bases observed at least twice
Dependencies
- Python 3.11.12
- Python packages can be downloaded in
requirements.txt - Samtools 1.14 (https://github.com/samtools/samtools/releases/tag/1.14)
- Though later versions exist, this was originally developed using
Samtools 1.14
- Though later versions exist, this was originally developed using
Contact
For questions or support, contact: nicholas.a.abad@gmail.com
Owner
- Login: nicholas-abad
- Kind: user
- Location: Heidelberg, Germany
- Website: https://www.linkedin.com/in/nicholasabad/
- Repositories: 7
- Profile: https://github.com/nicholas-abad
Machine Learning / Bioinformatics PhD Student at the DKFZ (German Cancer Research Institute)
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you'd like to cite DeepPileup, please use the following citation. However, if you'd like to cite the results, please cite the paper."
authors:
- family-names: Abad
given-names: Nicholas
orcid: https://orcid.org/0009-0004-8322-564X
title: "DeepPileup"
version: 1.0
identifiers:
- type: doi
value: https://www.biorxiv.org/content/10.1101/2024.06.03.597231v1
date-released: 2024-04-24
GitHub Events
Total
- Watch event: 2
- Push event: 7
Last Year
- Watch event: 2
- Push event: 7