vcfexpress

expressions on VCFs

https://github.com/brentp/vcfexpress

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

expressions on VCFs

Basic Info
  • Host: GitHub
  • Owner: brentp
  • License: mit
  • Language: Rust
  • Default Branch: main
  • Size: 1.45 MB
Statistics
  • Stars: 85
  • Watchers: 2
  • Forks: 4
  • Open Issues: 0
  • Releases: 5
Created almost 2 years ago · Last pushed 11 months ago
Metadata Files
Readme Changelog License Citation

README.md

vcfexpress

[!CAUTION] While the output of vcfexpress is tested and reliable, the error messages might be lacking. Please report.

Rust DOI

This is an experiment on how to implement user-expressions that can filter (and modify) a VCF and specify an output template. It uses lua as the expression language. It is fast Because of the speed and flexibility, we can, for example implement CSQ parsing in lua, just as a user could. The resulting functionality is as fast or faster than other tools that have this built in.

For the optional output template, it uses luau string templates where luau is lua with some extensions and very good speed.

Installation

  • For rust users: cargo install vcfexpress
  • Otherwise see Releases for a static linux binary

Examples

Further examples are collected here and we encourage users to suggest helpful examples or snippets.

Short functionality examples --- extract a single variant and output a bed of the variant: ``` vcfexpress filter -e "return variant.id == 'rs2124717267'" \ --template '{variant.chrom}\t{variant.start}\t{variant.stop}' -o var.bed $vcf ``` --- filter based on INFO and write bcf: ``` vcfexpress filter -e "return variant:info('AN') > 3000" \ -o high_an.bcf $input_vcf ``` --- check the sample fields to get variants where `all` samples have high DP. `all` is defined by `vcfexpress` (`any`, `filter` are also available). Users can load their own functions with `-p $lua_file`. ``` vcfexpress filter \ -e 'return all(function (dp) return dp > 10 end, variant:format("DP"))' \ -o all-high-dp.bcf $input_vcf ``` --- Extract variants that are HIGH impact according to the `CSQ` field. This uses user-defind code to parse the CSQ field in scripts/csq.lua. ``` vcfexpress filter \ -e 'csqs = CSQS.new(variant:info("ANN"), desc); return csqs:any(function(c) return c["Annotation_Impact"] == "HIGH" end)' \ -o all-high-impact.bcf $input_vcf \ -p scripts/csq.lua -p scripts/pre.lua ``` --- get all of the FORMAT fields for a single sample into a lua table. find variant that are high-quality hom-alts. ``` vcfexpress filter \ -e 's=variant:sample("NA12878"); return s.DP > 10 and s.GQ > 20 and s.GT[1] == 1 and s.GT[2] == 1' \ -o output.bcf \ input.vcf ``` --- add a new info field (`af_copy`) and set it. ``` $ cat pre.lua header:add_info({ID="af_copy", Number=1, Description="adding a single field", Type="Float"}) ``` then run with: ``` vcfexpress filter -p pre.lua -e 'return variant:format("AD")[1][2] > 0' \ -s 'af_copy=return variant:info("AF", 0)' \ input.vcf > output.vcf ```

speed

see speed

Lua API

Full documentation of lua attributes and methods is here

```lua variant.chrom -> string variant.REF (get/set) -> string variant.ALT (get/set) -> vec variant.id (get/set) -> string variant.start -> integer variant.stop -> integer variant.pos (get/set) -> integer -- 0-based variant.qual (get/set) -> number variant.filters (get/set) -> vec variant.FILTER (get/set) -> string (only first one reported) variant.genotypes -> vec variant:format("fieldname") -> vec -- optional 0-based 2nd arg to info() gets just the desired index. variant:info("fieldname") -> number|string|bool|vec -- useful to pprint(variant:sample("mysample")) to see available fields. variant:sample("samplename") -> table -- get all samples at once, more efficient than calling sample() multiple times variant:samples() -> table<samplename=table> -- e.g. s = variant:samples(); s.NA12878.DP tostring(variant) -> string -- tab-delimited vcf/variant output.

genotypes = variant.genotypes genotype = genotypes[i] -- get single genotype for 1 sample tostring(genotype) -- e.g. "0/1" genotype.alts -- integer for number of non-zero, non-unknown alleles

allele = genotype[1] allele.phased -> bool allele.allele -> integer e.g. 0 for "0" allele

header.samples (set/get) -> vec -- TODO: allow setting samples before iteration. header:infoget("DP") -> table header:formatget("AD") -> table

-- these header:add* are available only in the prelude. currently only Number=1 is supported. header:addinfo({Type="Integer", Number=1, Description="asdf", ID="new field"}) header:addformat({Type="Integer", Number=1, Description="xyz", ID="new format field"}) header:addfilter({ID="LowQual", Description="Qual less than 50"})

sample = variant:sample("NA12878") sample.DP -- any fields in the row are available. special case for GT. use pprint to see structure: pprint(sample) --[[ { .GQ = 63, .DP = 23, .GT = { -- GT gives index into alt alles (or -1 for .) [1] = 0, [2] = 1}, .AD = { [1] = 23, [2] = 0}, .PL = { [1] = 0, [2] = 63, [3] = 945}, -- this is the genotype phase. so with GT, this is 0|1 .phase = { [1] = false, [2] = true}} --]] ```

Usage

``` Filter a VCF/BCF and optionally print by template expression. If no template is given the output will be VCF/BCF

Usage: vcfexpress filter [OPTIONS]

Arguments: Path to input VCF or BCF file

Options: -e, --expression boolean Lua expression(s) to filter the VCF or BCF file -s, --set-expression expression(s) to set existing INFO field(s) (new ones can be added in prelude) e.g. --set-expression "AFmax=math.max(variant:info('AF'), variant:info('AFx'))" -t, --template