https://github.com/cancerit/hairpin2

artefact flagging algorithm based on Ellis et al. 2020 (DOI: 10.1038/s41596-020-00437-6) for LCM sequencing

https://github.com/cancerit/hairpin2

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: nature.com
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.7%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

artefact flagging algorithm based on Ellis et al. 2020 (DOI: 10.1038/s41596-020-00437-6) for LCM sequencing

Basic Info
  • Host: GitHub
  • Owner: cancerit
  • License: other
  • Language: Python
  • Default Branch: develop
  • Homepage:
  • Size: 453 KB
Statistics
  • Stars: 2
  • Watchers: 10
  • Forks: 0
  • Open Issues: 2
  • Releases: 6
Created over 1 year ago · Last pushed 10 months ago
Metadata Files
Readme Changelog License

README.md

hairpin2

hairpin2 – read-aware artefactual variant flagging

hairpin2 is designed to flag variants that are likely artefactual via a series of tests performed upon the read data associated with each variant. Initially, it was concieved to flag possible cruciform artefacts for LCM sequence data, but the concept has been extended and can detect a variety of potentially spurious variants (including indels). The tool operates on a VCF file containing one or more samples, and alignment files for all samples to be tested.

Given a VCF, and BAM files for the samples of that VCF, return a VCF with variants flagged with ADF if variants have anomalous distributions indicating that they are likely to be artefactual, ALF if relevant reads have lower median alignment score per base than a specified threshold, and DVF if variants appear to be the result of PCR error.

FLAGS

  • ADF; The ADF flag is an implementation of the artifact detection algorithm described in Ellis et al, 2020. It detects variants which appear with anomalously regular positional distribution in supporting reads.

  • ALF; The ALF flag indicates variants which are supported by reads with poor signal-to-noise, per the alignment score. It is complementary to the ADF flag – artefacts with anomalous distributions often cause a marked decrease in alignment score.

  • DVF; The DVF flag is a naive but effective algorithm for detecting variants which are the result of PCR error - in regions of low complexity, short repeats and homopolymer tracts can cause PCR stuttering. PCR stuttering can lead to, for example, an erroneous additional A on the read when amplifying a tract of As. If duplicated reads contain stutter, this can lead to variation of read length and alignment to reference between reads that are in fact duplicates. Because of this, these duplicates both evade dupmarking and give rise to spurious variants when calling. The DVF flag attempts to catch these variants by examining the regularity of the start and end coordinates of collections of supporting reads and their mates.

All flags are tunable such that their parameters can be configured to a variety of use cases and sequencing methods.

Note that the term "filter" is sometimes used in place of the word flag in the helptext.

DEPENDENCIES

  • Python >= 3.12

further dependencies (pysam, pydantic, and optionally pytest and pytest-cov) are detailed in pyproject.toml, and will be downloaded automatically if following the recommend install process

INSTALLATION

The easiest end-user approach is to install into a virtual environment: python -m venv .env source .env/bin/activate pip install . hairpin -h

for development, substitute: pip install -e ".[dev]" to install testing dependencies

ASSUMPTIONS & LIMITATIONS

hairpin2 is designed for paired data where alignment records have the MC tag and the complete CIGAR string is present in the CIGAR field (rather than the CG:B,I tag). If the MC tag is not present in your data, it can be added using samtools fixmate or biobambam2 bamsormadup. The tool can handle substitions, insertions, and deletions formatted per the VCF specification. At this time, the tool will not investigate mutations notated with angle brackets, e.g. <DEL>, complex mutations, or monomorphic reference. No further assumptions are made – other alignment tags and VCF fields are used, however they are mandatory per the relevant format specifications. If these requirements are limiting and you need the tool to be extended in some way, please request it.

USAGE

The recommended usage is to provide a config of flag parameters along with the VCF in question and the relavant alignments (.sam/.bam/.cram), like so: hairpin2 -c myconfig.json variants.vcf aln.cram > output.vcf A config of default parameters is provided in example-configs/. All config parameters are equivalently named to their command line overrides (see helptext below), except - is replaced by _.

full helptext: ``` Usage: hairpin2 [-h, --help] [OPTIONS] VCF ALIGNMENTS...

read-aware artefactual variant flagging algorithms. Flag variants in VCF using statistics calculated from supporting reads found in ALIGNMENTS, and emit the flagged VCF to stdout.

Options: -v, --version Show the version and exit. -h, --help show help (-h for basic, -hh for extended including config override options) -c, --config FILEPATH path to config JSON from which filter paramters will be loaded - can be overridden by extended arguments provided at runtime -o, --output-config FILEPATH log filter paramaters from run as a config JSON file -m, --name-mapping STR | FILEPATH If sample names in VCF differ from SM tags in alignment files, provide a key here to map them. Accepts a path to a JSON file, or JSON-formatted string of key-value pairs where keys are sample names in the VCF and all values are either the SM tag or the filepath of the relevant alignment - e.g. '{"sample0": "PDxxA", "sample1": "PDxxB"}' or '{"sample0": "A.bam", ...}'. When only a single alignment is provided, also accepts a JSON-spec top-level array of possible sample of interest names - e.g. '["TUMOR","TUMOUR"]'. Note that when providing a JSON-formatted string at the command line you must single quote the string, and use only double quotes internally -r, --cram-reference FILEPATH path to FASTA format CRAM reference, overrides $REF_PATH and UR tags for CRAM alignments -q, --quiet be quiet (-q to not log INFO level messages, -qq to additionally not log WARN) -p, --progress display progress bar on stderr during run

read validation config overrides: --min-clip-quality INT discard reads with mean base quality of aligned bases below this value, if they have soft-clipped bases [default: (35); 0<=x<=60] --min-mapping-quality INT discard reads with mapping quality below this value [default: (11); 0<=x<=93] --min-base-quality INT discard reads with base quality below this value at variant position [default: (25); x>=-1]

DVF config overrides: --duplication-window-size INT inclusive maximum window size in number of bases within which read pairs supporting a variant may be considered possible duplicates of eachother. -1 will disable duplicate detection [default: (6); x>=0] --loss-ratio FLOAT ratio of the number of reads found to be duplicates against the total number of supporting reads, above which a variant is flagged DVF. In logical AND with a hardcoded test that at least 2 supporting reads are independent, i.e. not duplicates of each other, to ensure that regardless of the value of loss_ratio collapse of duplicates to only a single supporting read always results in a DVF flag. Smaller is more sensitive. Set to 0.99 to rely only on the hardcoded test (practically speaking). [default: (0.49); 0.0<=x<=0.99]

ALF config overrides: --al-filter-threshold FLOAT threshold for median of read alignment score per base of all relevant reads, at and below which a variant is flagged ALF [default: (0.93); 0<=x<=93]

ADF config overrides: --edge-definition FLOAT percentage of a read that is considered to be "the edge" for the purposes of assessing variant position distribution [default: (0.15); 0.0<=x<=0.99] --edge-fraction FLOAT percentage of variants must occur within EDGEFRACTION of read edges to mark ADF flag [default: (0.9); 0.0<=x<=0.99] --min-mad-one-strand INT Mean Average Devaition of distances between variant position and read start above which a variant cannot be considered anomalous - used when only one strand has sufficient valid reads for testing [default: (0); x>=0] --min-sd-one-strand FLOAT stdev of distances between variant position and read start above which a variant cannot be considered anomalous - used when only one strand has sufficient valid reads for testing [default: (4.0); x>=0.0] --min-mad-both-strand-weak INT Mean Average Devaition of distances between variant position and read start above which a variant cannot be considered anomalous - used when both strands have sufficient valid reads for testing, in logical AND with `minsdbothstrandweak, and logical OR with corresponding strong condtion pair [default: (2); x>=0] --min-sd-both-strand-weak FLOAT stdev of distances between variant position and read start above which a variant cannot be considered anomalous - used when both strands have sufficient valid reads for testing, in logical AND with minmadbothstrandweak, and logical OR with corresponding strong condtion pair [default: (2.0); x>=0.0] --min-mad-both-strand-strong INT Mean Average Devaition of distances between variant position and read start above which a variant cannot be considered anomalous - used when both strands have sufficient valid reads for testing, in logical AND with minsdbothstrandstrong, and logical OR with corresponding weak condtion pair [default: (1); x>=0] --min-sd-both-strand-strong FLOAT stdev of distances between variant position and read start above which a variant cannot be considered anomalous - used when both strands have sufficient valid reads for testing, in logical AND with minmadbothstrand_weak, and logical OR with the corresponding weak condtion pair [default: (10.0); x>=0.0] --min-reads INT number of reads at and below which the hairpin filtering logic considers a strand to have insufficient reads for testing for a given variant [default: (1); x>=1] ``

Parameters are hopefully mostly clear from the helptext, but some warrant additional commentary:

  • --name-mapping – When using multisample VCFs, hairpin2 compares VCF sample names found in the VCF header to SM tags in alignments to match samples of interest to the correct alignment. If these IDs are different between the VCF and alignments, you'll need to provide a JSON key. If there are multiple samples of interest in a multisample VCF, and therefore it is necessary to provide multiple alignments, you will need to provide a mapping for each pair - e.g. -m '{"sample1":"SM1", "sample2":"SM2", ...}' or -m '{"sample1:"1.bam", ...}'. If there is only one sample of interest, and therefore only one alignment is provided to the tool, then you also have an optional shorthand - you need only indicate which VCF sample is the sample of interest, e.g. -m '["TUMOR"]'. When there is only one sample of interest, and therefore one alignment, but the sample of interest may have one of several possible names, you may also provide a comma separated string of possible names for the sample of interest, e.g. -m '["TUMOR", "TUMOUR"]' - users have found this valuable for high throughput workflows where the VCF input may be coming from one of several prior processes (which may name samples differently). In all cases, there must be one and only one match between alignment and VCF sample. In all cases, a path to a JSON file may be provided instead of the JSON string. Note that a VCF containing both a TUMOUR and a NORMAL sample contains 2 samples, and therefore is a multisample VCF.
  • --al-filter-threshold – In the predecessor to hairpin2, additionalBAMStatistics, this value was known as ASRD and the default was set at 0.87.
  • For all parameters, defaults were found by trial and error on LCM data and you may find it necessary to experiment with this parameter depending on data type.

Reading the test() method implementation of each flag may be informative. These can be found in hairpin2/filters/

The tool tests records in a VCF file and applies flags as appropriate. Reasoning for decisions is recorded in the INFO field of the VCF records, in the form <FLAG>=<alt>|<True/False>|<code>|.... Additional data (noted by the ellipsis) where present are described in the relevant header line of the VCF. The codes indicate the reason on which the decision was made, and are as follows:

DVF:

0 – insufficient supporting reads
1 – variant is/not the result of PCR stuttering

ALF:

0 – insufficient supporting reads
1 – variant passed/failed on flag threshold

ADF:

0 – insufficient supporting reads
1 – passed/failed on condition 60A(i) of Ellis et al. (ADF only)
2 – passed/failed on condition 60B(i) of Ellis et al. (ADF only)

The basic procedure of this implementation is as follows:

For each record in the VCF, test every alt for that record as follows:
1. for samples exhibiting the mutation, retrieve reads covering the region 2. test each read for validity for use in testing (i.e. base quality, do they express the correct alt, and so on) 3. performing statistical analysis on read context 4. pass or fail the record for each flag and log to INFO

TESTING

A test suite has been provided with the algorithm implementation. To run these tests run pytest . from within the install directory. hairpin2 must have been installed from that same directory, and be available on path (for example in a virtual environment). The tests can be found in the test directory. The focus is on testing all nodes (statements) and edges (paths between statements) for key scientific logic, rather than trying every possible input combination, in addition to testing critical boundary conditions. Since all input possibilities will pass via the same network/graph, if we prove that each part of that network functions correctly, we can be more confident that the program functions as it claims to. The tests are simple, and, once you have read them, it should be very easy to add your own further tests should you feel the need to confirm any behaviour.

TODO

  • automated regression testing
  • stricter config and params validation, most likely with pydantic, to catch misformatted configs earlier
  • further boundary condition testing
  • improve documentation - describe flags in individual sections, beyond replicating helptext
  • switch entirely to fstrings from .format()
  • disscussions to be had on multisample VCF support

LICENCE

``` hairpin2

Copyright (C) 2024, 2025 Genome Research Ltd.

Author: Alex Byrne ab63@sanger.ac.uk

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ```

Owner

  • Name: CASM IT
  • Login: cancerit
  • Kind: organization
  • Email: cgpit@sanger.ac.uk
  • Location: Hinxton, Cambridge, UK

CASM IT provide bioinformatic support for Cancer, Ageing and Somatic Mutation group at the Wellcome Sanger Institute

GitHub Events

Total
  • Create event: 7
  • Release event: 4
  • Issues event: 6
  • Watch event: 2
  • Delete event: 4
  • Issue comment event: 8
  • Push event: 58
  • Public event: 1
  • Pull request review comment event: 3
  • Pull request review event: 3
  • Pull request event: 14
Last Year
  • Create event: 7
  • Release event: 4
  • Issues event: 6
  • Watch event: 2
  • Delete event: 4
  • Issue comment event: 8
  • Push event: 58
  • Public event: 1
  • Pull request review comment event: 3
  • Pull request review event: 3
  • Pull request event: 14

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 3
  • Total pull requests: 8
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 1 day
  • Total issue authors: 3
  • Total pull request authors: 1
  • Average comments per issue: 1.67
  • Average comments per pull request: 0.25
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 8
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 1 day
  • Issue authors: 3
  • Pull request authors: 1
  • Average comments per issue: 1.67
  • Average comments per pull request: 0.25
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • blex-max (1)
  • Phuong-Le (1)
  • Ajburdett (1)
Pull Request Authors
  • blex-max (8)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

Dockerfile docker
  • python 3.12-slim build
pyproject.toml pypi
  • pytest ^8.2.2 develop
  • pysam ^0.22
  • python ^3.10