https://github.com/brentp/tnsv

add true-negative SVs from a population callset to a truth-set.

https://github.com/brentp/tnsv

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: ncbi.nlm.nih.gov, nature.com
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.3%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

add true-negative SVs from a population callset to a truth-set.

Basic Info
  • Host: GitHub
  • Owner: brentp
  • License: mit
  • Language: Nim
  • Default Branch: main
  • Size: 18.6 KB
Statistics
  • Stars: 15
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 6
Created about 5 years ago · Last pushed about 4 years ago
Metadata Files
Readme License

README.md

This software adds true-negative SVs to a truth-set. The true-negatives should be drawn from a real set.

In short you can use like this: wget https://github.com/brentp/tnsv/releases/download/v0.0.5/tnsv chmod +x ./tnsv ./tnsv sv-truth-set.vcf.gz \ population-sv-calls.vcf.gz \ | bcftools sort -O z -o HG002_SVS.with-gnomad-TN.vcf.gz

And many non-overlapping hom-ref (genotype 0/0) calls will be added to the truth-set (in this case, the output is HG002_SVS.with-gnomad-TN.vcf.gz) These calls will be in realistic locations (compared to random locations).

See below for links to some possible population call-sets.

Problem

This is a basic, known data-science problem, but it can still catch even seasoned analysts and it's especially easy to hit this problem in genomics.

Given a truth-set like the Genome in a Bottle SV set we can evaluate methods for SV detection and filtering.

In examples/svfilter.nim I have built a sophisticated classifier that will evaluate a set of SVs and randomly retain (classify as valid) 96% of variants.

This method is able to achieve: + 96% recall + 82% precision

on the actual HG002 SV truth set.

How!!!?

recall is (true-positives / (true-positives + false-negatives)), since we classify 96% of variants as true, then our recall must be 96%.

precision is (true-positives / (true-positives + false-positives)). So why is precision so high? We have: 10757 negative variants and 48606 positive variants. Since the classifier is randomly choosing support or not, then we can expect the precision is: 48606 / (48606 + 10757) which gives us the 82%.

In short, because there are so few negatives, a random classifier (with a bias toward the positives) will appear to have decent or even great performance.

Mitigation

One way to make it harder to miss this problem is to add many true-negative variants. Instead of doing this randomly, we add true variants from a given population or database set.

Only population variants that are not within dist bases (default 100) of a variant in the truth-set are added. This ensures that the added variants are realistic (not random locations).

For example, to add gnomad-sv calls to the Genome in a bottle truth set, use: ```

tnsv $truthset $populationset > $augmentedtruthset

tnsv HG002SVsTier1v0.6.vcf.gz \ nstd166.GRCh37.variantcall.vcf.gz \ | bcftools sort -O z -o HG002_SVS.with-gnomad-TN.vcf.gz ```

Now we can re-try our random classifer with the following results:

  • 96% Recall (which must be the case)
  • 15% Precision (down from 82% on the original truth-set)

With this, we have a such a low precision that we should note that something is wrong.

True Negative sets:

tnsv will do simple re-mapping of chromosomes to match the truth-set so that e.g. '22' in the population set can become 'chr22' if 'chr22' is present in the truth-set.

Owner

  • Name: Brent Pedersen
  • Login: brentp
  • Kind: user
  • Location: Oregon, USA

Doing genomics

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels