https://github.com/brentp/echtvar

using all the bits for echt rapid variant annotation and filtering

https://github.com/brentp/echtvar

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    2 of 8 committers (25.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.6%) to scientific vocabulary

Keywords

genetic-variants genomics variant-analysis variant-annotations

Keywords from Contributors

bioinformatics
Last synced: 6 months ago · JSON representation

Repository

using all the bits for echt rapid variant annotation and filtering

Basic Info
Statistics
  • Stars: 153
  • Watchers: 5
  • Forks: 10
  • Open Issues: 9
  • Releases: 13
Topics
genetic-variants genomics variant-analysis variant-annotations
Created over 4 years ago · Last pushed 11 months ago
Metadata Files
Readme Changelog License

README.md

Echtvar: Really, truly rapid variant annotation and filtering

Rust

Echtvar efficiently encodes variant allele frequency and other information from huge population datasets to enable rapid (1M variants/second) annotation of genetic variants. It chunks the genome into 1<<20 (~1 million) bases, encodes each variant into a 32 bit integer (with a supplemental table for those that can't fit due to large REF and/or ALT alleles). It uses the zip format, delta encoding and integer compression to create a compact and searchable format of any integer, float, or low-cardinality string columns selected from the population file.

read more at the why of echtvar

Getting started.

Get a static binary and pre-encoded echtvar files for gnomad v3.1.2 (hg38) here: https://github.com/brentp/echtvar/releases/latest That page contains exact instructions to get started with the static binary.

:arrow_down:Download or Build instructions for linux The linux binary is available via: ``` wget -O ~/bin/echtvar https://github.com/brentp/echtvar/releases/latest/download/echtvar \ && chmod +x ~/bin/echtvar \ && ~/bin/echtvar # show help ``` Users can make their own *echtvar* archives with `echtvar encode`, and pre-made archives for gnomAD version 3.1.2 are [here](https://github.com/brentp/echtvar/release) Rust users can build on linux with: ``` cargo build --release --target x86_64-unknown-linux-gnu ```

To run echtvar with an existing archive (we have several available in releases) is as simple as echtvar anno -e gnomad.echtvar.zip -e other.echtvar.zip input.vcf output.annotated.bcf

an optional filter that utilizes fields available any of the zip files can be added like: -i "gnomad_popmax_af < 0.01"

echtvar can also accept input from stdin using "-" or "/dev/stdin" for the input argument.

usage

encode

make (encode) a new echtvar file. This is usually done once (or download from those provided in the Release pages) and then the file can be re-used for the annotation (echtvar anno) step with each new query file. Note that input VCFs must be decomposed.

``` echtvar \ encode \ gnomad.v3.1.2.echtvar.zip \ conf.json # this defines the columns to pull from $inputvcf, and how to $inputpopulation_vcf[s] \ can be split by chromosome or all in a single file. name and encode them

```

See below for a description of the json file that defines which columns are pulled from the population VCF.

annotate

Annotate a decomposed (and normalized) VCF with an echtvar file and only output variants where gnomad_af from the echtvar file is < 0.01. Note that multiple echtvar files can be specified and the -i expression is optional and can be elided to output all variants.

echtvar anno \ -e gnomad.v3.1.2.echtvar.v2.zip \ -e dbsnp.echtvar.zip \ -i 'gnomad_popmax_af < 0.01' \ $cohort.input.bcf \ $cohort.echtvar-annotated.filtered.bcf

Configuration File for Encode

When running echtvar encode, a json5 (json with comments and other nice features) determines which columns are pulled from the input VCF and how they are stored.

A simple example is to pull a single integer field and give it a new name (alias):

[{"field": "AC", "alias": "gnomad_AC"}]

This will extract the "AC" field from the INFO and labeled as "gnomadAC" when later used to annotate a VCF. Note that it's important to give a description/unique prefix lke "`gnomad`" so as not to collide with fields already in the query VCF.

:arrow_down:Expand this section for detail on additional fields, including float and string types ``` [ {"field": "AC", "alias": "gnomad_AC"}, // this JSON file is json 5 and so can have comments // the missing value will default to -1, but the value: -2147483648 will // result in '.' as it is the missing value for VCF. {"field": "AN", "alias":, gnomad_AN", missing_value: -2147483648}, { field: "AF", alias: "gnomad_AF", missing_value: -1, // since all values (including floats) are stored as integers, echtvar internally converts // any float to an integer by multiplying by `multiplier`. // higher values give better precision and worse compression. // upon annotation, the score is divided by multiplier to give a number close to the original float. multiplier: 2000000, // set zigzag to true if your data has negative values zigzag: true, } // echtvar will save strings as integers along with a lookup. this can work for fields with a low cardinality. {"field": "string_field", "alias":, gnomad_string_field", missing_string: "UNKNOWN"}, // "FILTER" is a special case that indicates that echtvar should extract the FILTER column from the annotation vcf. {"field": "FILTER", "alias": "gnomad_filter"}, ] ``` The above file will extract 5 fields, but the user can chooose as many as they like when encoding. All fields in an `echtvar` file will be added (with the given alias) to any VCF it is used to annotate.

Other examples are available here

And full examples are in the wiki

Expressions

An optional expression will determine which variants are written. It can utilize any (and only) integer or float fields present in the echtvar file (not those present in the query VCF). An example could be:

-i 'gnomad_af < 0.01 && gnomad_nhomalts < 10'

The expressions are enabled by fasteval with supported syntax detailed here.

In brief, the normal operators: (&&, ||, +, -, *, /, <, <=, >, >= and groupings (, ), etc) are supported and can be used to craft an expression that returns true or false as above.

References and Acknowledgements

Without these (and other) critical libraries, echtvar would not exist.

echtvar is developed in the Jeroen De Ridder lab

Owner

  • Name: Brent Pedersen
  • Login: brentp
  • Kind: user
  • Location: Oregon, USA

Doing genomics

GitHub Events

Total
  • Create event: 2
  • Release event: 2
  • Issues event: 8
  • Watch event: 13
  • Issue comment event: 19
  • Push event: 3
Last Year
  • Create event: 2
  • Release event: 2
  • Issues event: 8
  • Watch event: 13
  • Issue comment event: 19
  • Push event: 3

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 179
  • Total Committers: 8
  • Avg Commits per committer: 22.375
  • Development Distribution Score (DDS): 0.078
Past Year
  • Commits: 4
  • Committers: 2
  • Avg Commits per committer: 2.0
  • Development Distribution Score (DDS): 0.25
Top Committers
Name Email Commits
Brent Pedersen b****e@g****m 165
Seth Stadick s****k@g****m 4
Miller m****5@e****u 3
Miguel Brown m****n@g****m 2
Dan Gealow d****w@p****m 2
Wen-Wei Liao w****o@w****u 1
JakeHagen J****n 1
Eric Normandeau e****c@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 34
  • Total pull requests: 14
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 1 day
  • Total issue authors: 20
  • Total pull request authors: 8
  • Average comments per issue: 2.94
  • Average comments per pull request: 2.5
  • Merged pull requests: 13
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 7
  • Pull requests: 0
  • Average time to close issues: about 12 hours
  • Average time to close pull requests: N/A
  • Issue authors: 6
  • Pull request authors: 0
  • Average comments per issue: 1.14
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • brentp (6)
  • m-pauper (3)
  • MattWellie (2)
  • JakeHagen (2)
  • liserjrqlxue (2)
  • edg1983 (2)
  • sstadick (2)
  • dvg-p4 (2)
  • migbro (2)
  • cliff7100 (1)
  • ruslan-abasov (1)
  • davetang (1)
  • fellen31 (1)
  • dnil (1)
  • cheanney (1)
Pull Request Authors
  • migbro (6)
  • dmiller15 (4)
  • dvg-p4 (3)
  • sstadick (2)
  • wwliao (1)
  • JakeHagen (1)
  • brentp (1)
  • enormandeau (1)
Top Labels
Issue Labels
enhancement (1)
Pull Request Labels

Dependencies

.github/workflows/build.yml actions
  • Shogan/rust-musl-action v1.0.2 composite
  • Spikatrix/upload-release-action b713c4b73f0a8ddda515820c124efc6538685492 composite
  • actions/checkout v2 composite
.github/workflows/ci.yml actions
  • actions/checkout v2 composite
Cargo.toml cargo
paper/Dockerfile docker
  • ubuntu 20.04 build