https://github.com/brentp/tnsv
add true-negative SVs from a population callset to a truth-set.
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: ncbi.nlm.nih.gov, nature.com -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.3%) to scientific vocabulary
Repository
add true-negative SVs from a population callset to a truth-set.
Basic Info
- Host: GitHub
- Owner: brentp
- License: mit
- Language: Nim
- Default Branch: main
- Size: 18.6 KB
Statistics
- Stars: 15
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 6
Metadata Files
README.md
This software adds true-negative SVs to a truth-set. The true-negatives should be drawn from a real set.
In short you can use like this:
wget https://github.com/brentp/tnsv/releases/download/v0.0.5/tnsv
chmod +x ./tnsv
./tnsv sv-truth-set.vcf.gz \
population-sv-calls.vcf.gz \
| bcftools sort -O z -o HG002_SVS.with-gnomad-TN.vcf.gz
And many non-overlapping hom-ref (genotype 0/0) calls will be added to the truth-set (in this case, the output is HG002_SVS.with-gnomad-TN.vcf.gz)
These calls will be in realistic locations (compared to random locations).
See below for links to some possible population call-sets.
Problem
This is a basic, known data-science problem, but it can still catch even seasoned analysts and it's especially easy to hit this problem in genomics.
Given a truth-set like the Genome in a Bottle SV set we can evaluate methods for SV detection and filtering.
In examples/svfilter.nim I have built a sophisticated classifier that will evaluate a set of SVs
and randomly retain (classify as valid) 96% of variants.
This method is able to achieve: + 96% recall + 82% precision
on the actual HG002 SV truth set.
How!!!?
recall is (true-positives / (true-positives + false-negatives)), since we classify 96% of variants as true, then our recall must be 96%.
precision is (true-positives / (true-positives + false-positives)). So why is precision so high?
We have: 10757 negative variants and 48606 positive variants. Since the classifier is randomly choosing support
or not, then we can expect the precision is: 48606 / (48606 + 10757) which gives us the 82%.
In short, because there are so few negatives, a random classifier (with a bias toward the positives) will appear to have decent or even great performance.
Mitigation
One way to make it harder to miss this problem is to add many true-negative variants. Instead of doing this randomly, we add true variants from a given population or database set.
Only population variants that are not within dist bases (default 100) of a variant in the truth-set are added.
This ensures that the added variants are realistic (not random locations).
For example, to add gnomad-sv calls to the Genome in a bottle truth set, use: ```
tnsv $truthset $populationset > $augmentedtruthset
tnsv HG002SVsTier1v0.6.vcf.gz \ nstd166.GRCh37.variantcall.vcf.gz \ | bcftools sort -O z -o HG002_SVS.with-gnomad-TN.vcf.gz ```
Now we can re-try our random classifer with the following results:
- 96% Recall (which must be the case)
- 15% Precision (down from 82% on the original truth-set)
With this, we have a such a low precision that we should note that something is wrong.
True Negative sets:
tnsv will do simple re-mapping of chromosomes to match the truth-set so that e.g. '22' in the population set can become 'chr22'
if 'chr22' is present in the truth-set.
Owner
- Name: Brent Pedersen
- Login: brentp
- Kind: user
- Location: Oregon, USA
- Twitter: brent_p
- Repositories: 220
- Profile: https://github.com/brentp
Doing genomics
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0