https://github.com/brentp/tnsv

add true-negative SVs from a population callset to a truth-set.

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: ncbi.nlm.nih.gov, nature.com
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.3%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

add true-negative SVs from a population callset to a truth-set.

Basic Info

Host: GitHub
Owner: brentp
License: mit
Language: Nim
Default Branch: main
Size: 18.6 KB

Statistics

Stars: 15
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 6

Created about 5 years ago · Last pushed about 4 years ago

Metadata Files

Readme License

README.md

This software adds true-negative SVs to a truth-set. The true-negatives should be drawn from a real set.

In short you can use like this: wget https://github.com/brentp/tnsv/releases/download/v0.0.5/tnsv chmod +x ./tnsv ./tnsv sv-truth-set.vcf.gz \ population-sv-calls.vcf.gz \ | bcftools sort -O z -o HG002_SVS.with-gnomad-TN.vcf.gz

And many non-overlapping hom-ref (genotype 0/0) calls will be added to the truth-set (in this case, the output is HG002_SVS.with-gnomad-TN.vcf.gz) These calls will be in realistic locations (compared to random locations).

See below for links to some possible population call-sets.

Problem

This is a basic, known data-science problem, but it can still catch even seasoned analysts and it's especially easy to hit this problem in genomics.

Given a truth-set like the Genome in a Bottle SV set we can evaluate methods for SV detection and filtering.

In examples/svfilter.nim I have built a sophisticated classifier that will evaluate a set of SVs and randomly retain (classify as valid) 96% of variants.

This method is able to achieve: + 96% recall + 82% precision

on the actual HG002 SV truth set.

How!!!?

recall is (true-positives / (true-positives + false-negatives)), since we classify 96% of variants as true, then our recall must be 96%.

precision is (true-positives / (true-positives + false-positives)). So why is precision so high? We have: 10757 negative variants and 48606 positive variants. Since the classifier is randomly choosing support or not, then we can expect the precision is: 48606 / (48606 + 10757) which gives us the 82%.

In short, because there are so few negatives, a random classifier (with a bias toward the positives) will appear to have decent or even great performance.

Mitigation

One way to make it harder to miss this problem is to add many true-negative variants. Instead of doing this randomly, we add true variants from a given population or database set.

Only population variants that are not within dist bases (default 100) of a variant in the truth-set are added. This ensures that the added variants are realistic (not random locations).

For example, to add gnomad-sv calls to the Genome in a bottle truth set, use: ```

tnsv $truthset $populationset > $augmentedtruthset

tnsv HG002SVsTier1v0.6.vcf.gz \ nstd166.GRCh37.variantcall.vcf.gz \ | bcftools sort -O z -o HG002_SVS.with-gnomad-TN.vcf.gz ```

Now we can re-try our random classifer with the following results:

96% Recall (which must be the case)
15% Precision (down from 82% on the original truth-set)

With this, we have a such a low precision that we should note that something is wrong.

True Negative sets:

tnsv will do simple re-mapping of chromosomes to match the truth-set so that e.g. '22' in the population set can become 'chr22' if 'chr22' is present in the truth-set.

Owner

Name: Brent Pedersen
Login: brentp
Kind: user
Location: Oregon, USA

Twitter: brent_p
Repositories: 220
Profile: https://github.com/brentp

Doing genomics

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/brentp/tnsv

Science Score: 23.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Problem

How!!!?

Mitigation

tnsv $truthset $populationset > $augmentedtruthset

True Negative sets:

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels