https://github.com/brentp/slivar

genetic variant expressions, annotation, and filtering for great good.

Keywords

genomics rare-disease rare-variant-analysis variant-analysis variant-interpretation

Last synced: 6 months ago · JSON representation

Repository

genetic variant expressions, annotation, and filtering for great good.

Basic Info

Host: GitHub
Owner: brentp
License: mit
Language: Nim
Default Branch: master
Homepage:
Size: 3.26 MB

Statistics

Stars: 262
Watchers: 9
Forks: 24
Open Issues: 46
Releases: 34

Topics

genomics rare-disease rare-variant-analysis variant-analysis variant-interpretation

Created about 7 years ago · Last pushed 7 months ago

Metadata Files

Readme Changelog License

slivar: filter/annotate variants in VCF/BCF format with simple expressions

If you use slivar, please cite the paper

slivar is a set of command-line tools that enables rapid querying and filtering of VCF files. It facilitates operations on trios and groups and allows arbitrary expressions using simple javascript.

use-cases for `slivar`

annotate variants with gnomad allele frequencies from combined exomes + whole genomes at > 30K variants/second using only a 1.5GB compressed annotation file.
call denovo variants with a simple expression that uses mom, dad, kid labels that is applied to each trio in a cohort (as inferred from a pedigree file). kid.het && mom.hom_ref && dad.hom_ref && kid.DP > 10 && mom.DP > 10 && dad.DP > 10
define and filter on arbitrary groups with labels. For example, 7 sets of samples each with 1 normal and 3 tumor time-points: normal.AD[0] = 0 && tumor1.AB < tumor2.AB && tumor2.AB < tumor3.AB
filter variants with simple expressions: variant.call_rate > 0.9 && variant.FILTER == "PASS" && INFO.AC < 22 && variant.num_hom_alt == 0
see using slivar for rare disease research

slivar logo

slivar has sub-commands: + expr: filter and/or annotate with INFO, trio, sample, group expressions + make-gnotate: make a compressed zip file of annotations for use by slivar + compound-hets: true compound hets using phase-by-inheritance within gene annotations

``` vcf=/path/to/your/vcf.vcf.gz ped=/path/to/your/pedigree.ped wget https://github.com/brentp/slivar/releases/download/v0.2.8/slivar chmod +x ./slivar wget https://raw.githubusercontent.com/brentp/slivar/master/js/slivar-functions.js wget https://slivar.s3.amazonaws.com/gnomad.hg38.genomes.v3.fix.zip

example command

./slivar expr --js slivar-functions.js -g gnomad.hg38.genomes.v3.fix.zip \ --vcf $vcf --ped $ped \ --info "INFO.gnomadpopmaxaf < 0.01 && variant.FILTER == 'PASS'" \ --trio "exampledenovo:denovo(kid, dad, mom)" \ --family-expr "denovo:fam.every(segregatingdenovo)" \ --trio "custom:kid.het && mom.het && dad.het && kid.GQ > 20 && mom.GQ > 20 && dad.GQ > 20" \ --pass-only ```

The pedigree format is explained here

Commands

expr

expr allows filtering on (abstracted) trios and groups. For example, given a VCF (and ped/fam file) with 100 trios, slivar will apply an expression with kid, mom, dad identifiers to each trio that it automatically extracts.

expr can also be used, for example to annotate with population allele frequencies from a gnotate file without any sample filtering. See the wiki for more detail and the gnotate section for gnotation files that we distribute for slivar.

expr commands are quite fast, but can be parallelized using pslivar.

trio

when --trio is used, slivar finds all trios in a VCF, PED pair and let's the user specify an expression with indentifiers of kid, mom, dad that is applied to each possible trio. For example, a simple expression to call de novo variants:

javascript variant.FILTER == 'PASS' && \ # variant.call_rate > 0.95 && \ # genotype must be known for most of cohort. INFO.gnomad_af < 0.001 && \ # rare in gnomad (must be in INFO [but see below]) kid.het && mom.hom_ref && dad.hom_ref && \ # also unknown kid.DP > 7 && mom.DP > 7 && dad.DP > 7 && \ # sufficient depth in all (mom.AD[1] + dad.AD[1]) == 0 # no evidence for alternate in the parents

This requires passing variants that are rare in gnomad that have the expected genotypes and do not have any alternate evidence in the parents. If there are 200 trios in the ped::vcf given, then this expression will be tested on each of those 200 trios.

When trios are not sufficient, use Family Expressions which allow more heterogeneous family structures.

The expressions are javascript so the user can make these as complex as needed.

``bash slivar expr \ --pass-only \ # output only variants that pass one of the filters (default is to output all variants) --vcf $vcf \ --ped $ped \ # compressed zip that allows fast annotation so thatgnomadaf` is available in the expressions below. --gnotate $gnomadaf.zip \ # any valid javascript is allowed in a file here. provide functions to be used below. --js js/slivar-functions.js \ --out-vcf annotated.bcf \ # this filter is applied before the trio filters and can speed evaluation if it is stringent. --info "variant.callrate > 0.9" \ --trio "denovo:kid.het && mom.homref && dad.homref \ && kid.AB > 0.25 && kid.AB < 0.75 \ && (mom.AD[1] + dad.AD[1]) == 0 \ && kid.GQ >= 20 && mom.GQ >= 20 && dad.GQ >= 20 \ && kid.DP >= 12 && mom.DP >= 12 && dad.DP >= 12" \ --trio "informative:kid.GQ > 20 && dad.GQ > 20 && mom.GQ > 20 && kid.alts == 1 && \ ((mom.alts == 1 && dad.alts == 0) || (mom.alts == 0 && dad.alts == 1))" \ --trio "recessive:trioautosomal_recessive(kid, mom, dad)"

```

Note that slivar does not give direct access to the genotypes, instead exposing hom_ref, het, hom_alt and unknown or via alts where 0 is homozygous reference, 1 is heterozygous, 2 is homozygous alternate and -1 when the genotype is unknown. It is recommended to decompose a VCF before sending to slivar

Here it is assumed that trio_autosomal_recessive is defined in slivar-functions.js; an example implementation of that and other useful functions is provided here. Note that it's often better to use --family-expr instead as it's more flexible than trio expressions.

Family Expressions

Trios are a nice abstraction for cohorts consisting of only trios, but for more general uses, there is --family-expr for example, given either a duo, or a quartet, we can find variants present only in affected samples with:

--family-expr "aff_only:fam.every(function(s) { return s.het == s.affected && s.hom_ref == !s.affected && s.GQ > 5 })"

Note that this does not explicitly check for transmission or non-transmission between parents and off-spring so it is less transparent than the trio mode, but more flexible.

Groups

A trio is a special-case of a group that can be inferred from a pedigree. For more specialized use-cases, a group can be specified. For example we could, instead of using --trio, use a group file like: ```

kid mom dad

sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 ```

Where, here we have specified 3 trios below a header with their "labels". This can be accomplished using --trio, but we can for example specify quartets like this:

```

kid mom dad sibling

sample1 sample2 sample3 sample10 sample4 sample5 sample6 sample11 sample7 sample8 sample9 sample12 ```

where sample10 will be available as "sibling" in the first family and an expression like: bash kid.alts == 1 && mom.alts == 0 && dad.alts == 0 and sibling.alts == 0 could be specified and it would automatically be applied to each of the 3 families.

Another example could be looking at somatic variants with 3 samples, each with a normal and 4 time-points of a tumor: ```

normal tumor1 tumor2 tumor3 tumor4

ss1 ss8 ss9 ss10 ss11 ss2 ss12 ss13 ss14 ss15
ss3 ss16 ss17 ss18 ss19
```

where, again each row is a sample and the ID's (starting with "ss") will be injected for each sample to allow a single expression like: bash normal.hom_ref && normal.DP > 10 \ && tumor1.AB > 0 \ && tumor1.AB < tumor2.AB \ && tumor2.AB < tumor3.AB \ && tumor3.AB < tumor4.AB

to find a somatic variant that has increasing frequency (AB is allele balance) along the tumor time-points. More detail on groups is provided here

Sample Expressions

Users can specify a boolean expression that is tested against each sample using e.g.:

--sample-expr "hi_quality:sample.DP && sample.GQ > 10"

Each sample that passes this expression will be have its sample id appended to the INFO field of hi_quality which is added to the output VCF.

make-gnotate

Users can make their own gnotate files like:

bash slivar make-gnotate --prefix gnomad \ --field AF_popmax:gnomad_popmax_af \ --field nhomalt:gnomad_num_homalt \ gnomad.exomes.r2.1.sites.vcf.gz gnomad.genomes.r2.1.sites.vcf.gz

this will pull AF_popmax and nhomalt from the INFO field and put them into gnomad.zip as gnomad_popmax_af and gnomad_num_homalt respectively. The resulting zip file will contain the union of values seen in the exome and genomes files with the maximum value for any intersection. Note that the names (gnomad_popmax_af and gnomad_num_homalt in this case) should be chosen carefully as those will be the names added to the INFO of any file to be annotated with the resulting gnomad.zip

More information on make-gnotate is in the wiki

compound-het

This command is used to find compound heterozygous variants (with phasing-by-inheritance) in trios. It is used after filtering to rare(-ish) heterozygotes.

See a full description of use here

NOTE that by default, this command limits to a subset of impacts; this is adjustable with the --skip flag. See more on the wiki

tsv

This command is used to convert a filtered and annotated VCF to a TSV (tab-separated value file) for final examination. An example use is:

slivar tsv -p $ped \ -s denovo -s x_recessive \ -c CSQ \ -i gnomad_popmax_af -i gnomad_nhomalt \ -g gene_desc.txt -g clinvar_gene_desc.txt \ $vcf > final.tsv

where denovo and x_recessive indicate the INFO fields that contain lists of samples (as added by slivar) that should be extracted. and gnomad_popmax_af and gnomad_nhomalt are pulled from the INFO field. The -c arugment (CSQ) tells slivar that it can get gene, transcript and impact information from the CSQ field in the INFO. And the -g arguments are tab-delimited files of gene -> description where the description is added to the text output for quick inspection. Run slivar tsv without any arguments for examples on how to create these for pLI and clinvar.

Also see the wiki

duo-del

slivar duo-del finds structural deletions in parent-child duos using non-transmission of alleles. this can work to find deletions in exome data using genotypes, thereby avoiding the problems associated with depth-based CNV calling in exomes.

see: https://github.com/brentp/slivar/wiki/finding-deletions-in-parent-child-duos

Data Driven Cutoffs

slivar ddc is a tool to discover data-driven cutoffs from a VCF and pedigree information. It generates an interative VCF so a user can see how mendelian violation and transmissions are effected by varying cutoffs for values in the INFO and FORMAT fields.

See the wiki for more details.

Attributes

anything in the INFO is available as e.g. INFO.DP
INFO.impactful which, if CSQ (VEP), BCSQ (bcftools), or ANN (snpEff) is present indicates if the highest impact is "impactful". see wiki and INFO.genic which includes other gene impacts like synonymous. Also INFO.highest_impact_order explained in the wiki
variant consequences such as in INFO.CSQ can be parsed and used as object as described here
if FORMAT.AB is not present, it is added so one can filter with kid.AB > 0.25 && kid.AB < 0.75
variant attributes are: CHROM, POS, start, end, ID, REF, ALT, QUAL, FILTER, is_multiallelic
calculated variant attributes include: aaf, hwe_score, call_rate, num_hom_ref, num_het, num_hom_alt, num_unknown
numeric and flag sample attributes (via kid, mom, dad) included in the FORMAT. available as e.g. kid.AD[1], mom.DP, etc.
if the environment variable SLIVAR_FORMAT_STRINGS is not empty, then string sample fields will be available. these are not populated by default as they are used less often and impact performance.
sample attributes for hom_ref, het, hom_alt, unknown which are synonums for sample.alts of 0, 1, 2, -1 respectively.
sample attributes from the ped for affected, phenotype, sex, id are available as, e.g. kid.sex. phenotype is a string taken directly from the pedigree file while affected is a boolean.
sample relations are available as mom, dad, kids. mom and dad will be undefined if not available and kids will be an empty array.
a VCF object contains CSQ, BCSQ, ANN if those are present in the header (from VEP, BCFTOOLS, SnpEFF). The content is a list indicating the order of entries in the field e.g. ["CONSEQUENCE", "CODONS","AMINO_ACIDS", "GENE", ...]

How it works

slivar embeds the duktape javascript engine to allow the user to specify expressions. For each variant, each trio (and each sample), it fills the appropriate attributes. This can be intensive for VCFs with many samples, but this is done as efficiently as possible such that slivar can evaluate 10's of thousand of variants per second even with dozens of trios.

Summary Table

slivar outputs a summary table with rows of samples and columns of expression where each value indicates the number of variants that passed the expression in each sample. By default, this goes to STDOUT but if the environment variable SLIVAR_SUMMARY_FILE is set, slivar will write the summary to that file instead.

Gnotation Files

Users can create their own gnotation files with slivar make-gnotate, but we provide:

gnomad for hg37 with AF popmax, numhomalts (total and controls only) here
gnomad for hg38 (v3) genomes here
lifted gnomad exomes+genomes for hg38 with AF popmax, numhomalts (updated in release v0.1.2) here <!--
gnomad genomes (71,702 samples) for hg38 with AF popmax, numhomalts (updated in release v0.1.7) here -->
spliceai scores (maximum value of the 4 scores in spliceai) here
topmed allele frequencies (via dbsnp) these can be used with INFO.topmed_af. Useful when analyzing data in hg38 because some variants in hg38 are not visible in GRCh37

The available fields can be seen with, for example:

$ unzip -l gnomad.hg38.v2.zip | grep -oP "gnotate-[^.]+" | sort -u gnotate-gnomad_nhomalt gnotate-gnomad_nhomalt_controls gnotate-gnomad_popmax_af gnotate-gnomad_popmax_af_controls gnotate-variant

indicating that INFO.gnomad_nhomalt, INFO.gnomad_nhomalt_controls, INFO.gnomad_popmax_af and INFO.gnomad_popmax_af_controls will be the fields after they are added to the INFO.

Owner

Name: Brent Pedersen
Login: brentp
Kind: user
Location: Oregon, USA

Twitter: brent_p
Repositories: 220
Profile: https://github.com/brentp

Doing genomics

GitHub Events

Total

Issues event: 7
Watch event: 13
Issue comment event: 17
Push event: 2
Pull request review event: 1
Fork event: 1

Last Year

Issues event: 7
Watch event: 13
Issue comment event: 17
Push event: 2
Pull request review event: 1
Fork event: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 93
Total pull requests: 11
Average time to close issues: about 1 month
Average time to close pull requests: about 13 hours
Total issue authors: 39
Total pull request authors: 3
Average comments per issue: 5.02
Average comments per pull request: 0.09
Merged pull requests: 9
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 0
Average time to close issues: about 3 hours
Average time to close pull requests: N/A
Issue authors: 3
Pull request authors: 0
Average comments per issue: 0.67
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

brentp (22)
nroak (10)
mvelinder (5)
sprakashUTH (4)
snashraf (3)
syouligan (3)
edg1983 (3)
liserjrqlxue (3)
williamrowell (3)
seboyden (3)
team-tomato-salad (2)
prasundutta87 (2)
srynobio (2)
cvlvxi (2)
weizhu365 (2)

Pull Request Authors

brentp (6)
brwnj (4)
jxchong (1)

Top Labels

Issue Labels

help wanted (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads: unknown

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 35

proxy.golang.org: github.com/brentp/slivar

Documentation: https://pkg.go.dev/github.com/brentp/slivar#section-documentation
License: mit
Latest release: v0.3.2
published 8 months ago

Versions: 35
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.3%

Average: 5.4%

Dependent repos count: 5.6%

Last synced: 6 months ago

https://github.com/brentp/slivar

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

slivar: filter/annotate variants in VCF/BCF format with simple expressions

use-cases for slivar

Table of Contents

Installation

QuickStart

example command

Commands

expr

trio

Family Expressions

Groups

kid mom dad

kid mom dad sibling

normal tumor1 tumor2 tumor3 tumor4

Sample Expressions

make-gnotate

compound-het

tsv

duo-del

Data Driven Cutoffs

Attributes

How it works

Summary Table

Gnotation Files

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

proxy.golang.org: github.com/brentp/slivar

Rankings

Dependencies

use-cases for `slivar`