vembrane

vembrane filters VCF records using python expressions

https://github.com/vembrane/vembrane

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
✓
Committers with academic emails
6 of 10 committers (60.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.5%) to scientific vocabulary

Keywords

bioinformatics filter filter-expression vcf vcf-filtering

Last synced: 11 months ago · JSON representation ·

Repository

vembrane filters VCF records using python expressions

Basic Info

Host: GitHub
Owner: vembrane
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 18.7 MB

Statistics

Stars: 61
Watchers: 6
Forks: 5
Open Issues: 14
Releases: 39

Topics

bioinformatics filter filter-expression vcf vcf-filtering

Created about 6 years ago · Last pushed 12 months ago

Metadata Files

Readme Changelog License Citation

vembrane: variant filtering using python expressions

vembrane allows to simultaneously filter variants based on any INFO or FORMAT field, CHROM, POS, ID, REF, ALT, QUAL, FILTER, and the annotation field ANN. When filtering based on ANN, annotation entries are filtered first. If no annotation entry remains, the entire variant is deleted.

vembrane relies on pysam for reading/writing VCF/BCF files.

For a comparison with similar tools have a look at the vembrane benchmarks.

Installation

vembrane is available in bioconda and can either be installed into an existing conda environment with mamba install -c conda-forge -c bioconda vembrane or into a new named environment mamba create -n environment_name -c conda-forge -c bioconda vembrane. Alternatively, if you are familiar with git and uv, clone this repository and run uv sync. See docs/develop.md for further details.

Subcommands

vembrane provides several subcommands for different tasks:

filter: Filters VCF/BCF files based on flexible Python expressions. See the detailed documentation below. sh vembrane filter 'CHROM == "chr3" and ANN["Consequence"].any_is_a("frameshift_variant")' variants.bcf
tag: A non-destructive version of filter. Instead of removing records, it adds a user-defined tag to the FILTER field of records that match an expression. For more details, see docs/tag.md. sh vembrane tag --tag quality_at_least_30="QUAL >= 30" variants.vcf
table: Creates tabular (TSV) files from VCF/BCF data. The columns can be defined with flexible Python expressions. For more details, see docs/table.md. sh vembrane table 'CHROM, POS, 10**(-QUAL/10), ANN["CLIN_SIG"]' input.vcf > table.tsv
annotate: Annotates VCF/BCF files with data from an external table-like file (e.g., TSV, CSV), based on genomic coordinates. For more details, see docs/annotate.md. sh vembrane annotate example.yaml example.bcf > annotated.vcf
sort: Sorts VCF/BCF files. The sort order can be defined via flexible Python expressions and thus be based on any field and annotation that occurs in the VCF/BCF (e.g. impact or clinical significance). The primary use case is variant prioritization. For more details, see docs/sort.md. sh vembrane sort input.vcf 'round(ANN["gnomad_AF"], 1), -ANN["REVEL"]' > prioritized.vcf
structured: Converts VCF records into structured formats like JSON, JSONL, or YAML using a flexible YTE template. For more details, see docs/structured.md. sh vembrane structured template.yml input.vcf --output output.json
fhir: Converts VCF records into FHIR observations. For more details, see docs/fhir.md. sh vembrane fhir tumor GRCh38 --profile mii_molgen_v2025.0.0 --output-fmt json --annotation-key ANN < sample.vcf > sample-tumor.fhir.json

`vembrane filter`

Full documentation is also available in docs/filter.md.

Usage

vembrane takes two positional arguments: The filter expression and the input file; the latter may be omitted to read from stdin instead, making it easy to use vembrane in pipe chains. ``` usage: vembrane filter [options] expression [input vcf/bcf]

options: -h, --help show this help message and exit --output OUTPUT, -o OUTPUT Output file. If not specified, output is written to STDOUT. --output-fmt {vcf,bcf,uncompressed-bcf}, -O {vcf,bcf,uncompressed-bcf} Output format. --annotation-key FIELDNAME, -k FIELDNAME The INFO key for the annotation field. This defaults to "ANN", but tools might use other field names. For example, default VEP annotations can be parsed by setting "CSQ" here. --aux NAME=PATH, -a NAME=PATH Path to an auxiliary file containing a set of symbols. --ontology PATH Path to an ontology in OBO format. The ontology is loaded into memory and can be used in expressions via the SO symbol. May be compressed with gzip, bzip2 or xz. Defaults to built-in ontology (from sequenceontology.org). --keep-unmatched Keep all annotations of a variant if at least one of them passes the expression (mimics SnpSift behaviour). --preserve-order Ensures that the order of the output matches that of the input. This is only useful if the input contains breakends (BNDs) since the order of all other variants is preserved anyway. ```

Filter expression

The filter expression can be any valid python expression that evaluates to a value of type bool. If you want to use truthy values, you need to wrap the expression in bool(), or aggregate multiple values via any() or all().

However, functions and symbols available have been restricted to the following:

all, any
abs, len, max, min, round, sum
enumerate, filter, iter, map, next, range, reversed, sorted, zip
dict, list, set, tuple
bool, chr, float, int, ord, str
Any function or symbol from math
Any function from statistics
Regular expressions via re
custom functions:
- without_na(values: Iterable[T]) -> Iterable[T] (keep only values that are not NA)
- replace_na(values: Iterable[T], replacement: T) -> Iterable[T] (replace values that are NA with some other fixed value)
- genotype related:
  - count_hom, count_het , count_any_ref, count_any_var, count_hom_ref, count_hom_var
  - is_hom, is_het, is_hom_ref , is_hom_var
  - has_ref, has_var

Available fields

The following VCF fields can be accessed in the filter expression:

| Name | Type | Interpretation | Example expression | | --------- | ---------------------------- | -------------------------------------------------------------------------------------------------- | ------------------------------ | | INFO | Dict[str, Any¹] | INFO field -> Value | INFO["DP"] > 0 | | ANN² | Dict[str, Any³] | ANN field -> Value² | ANN["SYMBOL"] == "CDH2"² | | CHROM | str | Chromosome Name | CHROM == "chr2" | | POS | int | Chromosomal position (1-based) | 24 < POS < 42 | | END | int | Chromosomal end position (1-based, inclusive, NA for breakends); also accessible via INFO["END"] | 24 < END < 42 | | ID | str | Variant ID | ID == "rs11725853" | | REF | str | Reference allele | REF == "A" | | ALT | str | Alternative allele⁴ | ALT == "C" | | QUAL | float | Quality | QUAL >= 60 | | FILTER | List[str] | Filter tags | "PASS" in FILTER | | FORMAT | Dict[str, Dict[str, Any¹]] | Format -> (Sample -> Value) | FORMAT["DP"][SAMPLES[0]] > 0 | | SAMPLES | List[str] | [Sample] | "Tumor" in SAMPLES | | INDEX | int | Index of variant in the file | INDEX < 10 |

¹ depends on type specified in VCF header

² if your VCF defines annotations under a key other than ANN (e.g. VEP's CSQ) you have to specify this via the --annotation-key flag (e.g. --annotation-key CSQ). You can (and should, for portability) still use ANN in your expressions then (although the given annotation key works as well, e.g. CSQ["SYMBOL"]).

³ for the usual snpeff and vep annotations, custom types have been specified; any unknown ANN field will simply be of type str. If something lacks a custom parser/type, please consider filing an issue in the issue tracker.

⁴ vembrane does not handle multi-allelic records itself. Instead, such files should be preprocessed by either of the following tools (preferably even before annotation): - bcftools norm -m-any […] - gatk LeftAlignAndTrimVariants […] --split-multi-allelics - vcfmulti2oneallele […]

Examples

Only keep annotations and variants where gene equals "CDH2" and its impact is "HIGH": sh vembrane filter 'ANN["SYMBOL"] == "CDH2" and ANN["Annotation_Impact"] == "HIGH"' variants.bcf
Only keep variants with quality at least 30: sh vembrane filter 'QUAL >= 30' variants.vcf
Only keep annotations and variants where feature (transcript) is ENST00000307301: sh vembrane filter 'ANN["Feature"] == "ENST00000307301"' variants.bcf
Only keep annotations and variants where protein position is less than 10: sh vembrane filter 'ANN["Protein_position"].start < 10' variants.bcf
Only keep variants where the ID matches the regex pattern ^rs[0-9]+: sh vembrane filter 'bool(re.search("^rs[0-9]+", ID or ""))' variants.vcf
Only keep variants where mapping quality is exactly 60: sh vembrane filter 'INFO["MQ"] == 60' variants.bcf
Only keep annotations and variants where CLINSIG contains "pathogenic", "likelypathogenic" or "drugresponse": ```sh vembrane filter \ 'any(entry in ANN["CLINSIG"] for entry in ("pathogenic", "likelypathogenic", "drugresponse"))' \ variants.vcf Using set operations, the same may also be expressed as:sh vembrane filter \ 'not {"pathogenic", "likelypathogenic", "drugresponse"}.isdisjoint(ANN["CLIN_SIG"])' \ variants.vcf ```
Filter on sample specific values:
- by sample name: sh vembrane filter 'FORMAT["DP"]["specific_sample_name"] > 0' variants.vcf
- by sample index: sh vembrane filter 'FORMAT["DP"][0] > 0' variants.vcf
- by sample name based on the index in the list of SAMPLES: sh vembrane filter 'FORMAT["DP"][SAMPLES[0]] > 0' variants.vcf
- using all or a subset of SAMPLES: sh vembrane filter 'mean(FORMAT["DP"][s] for s in SAMPLES) > 10' variants.vcf
Filter on genotypes for specific samples (named "kid", "mom", "dad"): sh vembrane filter \ 'is_het("kid") and is_hom_ref("mom") and is_hom_ref("dad") and \ all(FORMAT["DP"][s] > 10 for s in ["kid", "mom", "dad"])' \ variants.vcf
Explicitly access the GT field for the first sample in the file: sh vembrane filter 'FORMAT["GT"][0] == (1, 1)' variants.vcf

Custom `ANN` types

vembrane parses entries in the annotation field (ANN or whatever you specify under --annotation-key) as outlined in docs/ann_types.md.

Missing values in annotations

If a certain annotation field lacks a value, it will be replaced with the special value of NA. Comparing with this value will always result in False, e.g. ANN["MOTIF_POS"] > 0 will always evaluate to False if there was no value in the "MOTIF_POS" field of ANN (otherwise the comparison will be carried out with the usual semantics).

For fields with custom types, such as ANN["Protein_position"] which is of type PosRange with attributes start, end and length, trying to access ANN["Protein_position"].start will result in NA if there was no value for "Protein_position" in the annotation of the respective record, i.e. the access will return NA instead of raising an AttributeError. In general, any attribute access on NA will result in NA (and issue a warning to stderr).

Since you may want to use the regex module to search for matches, NA also acts as an empty str, such that re.search("nothing", NA) returns nothing instead of raising an exception.

Explicitly handling missing/optional values in INFO or FORMAT fields can be done by checking for NA, e.g.: INFO["DP"] is NA.

Handling missing/optional values in fields other than INFO or FORMAT can be done by checking for None, e.g ID is not None.

Sometimes, multi-valued fields may contain missing values; in this case, the without_na function can be convenient, for example: mean(without_na(FORMAT['DP'][s] for s in SAMPLES)) > 2.3. It is also possible to replace NA with some constant value with the replace_na function: mean(replace_na((FORMAT['DP'][s] for s in SAMPLES), 0.0)) > 2.3

Auxiliary files

vembrane supports additional files, such as lists of genes or ids with the --aux NAME=path/to/file option. The file should contain one item per line and is parsed as a set. For example vembrane filter --aux genes=genes.txt "ANN['SYMBOL'] in AUX['genes']" variants.vcf will keep only records where the annotated symbol is in the set specified in genes.txt.

Ontologies

vembrane supports ontologies in OBO format. The ontology is loaded into memory and can be accessed in the filter expression via the SO symbol. This enables filtering based on relationships between ontology terms. For example, vembrane filter --ontology so.obo 'ANN["Consequence"].any_is_a("intron_variant")' will keep only records where at least one of the consequences is an intron variant or a subtype thereof. If no ontology is provided, the built-in ontology from sequenceontology.org (date: 2024-06-06) is loaded automatically if the SO symbol is accessed.

There are three relevant classes/types: - Term: Represents a term in the ontology. It inherits from str and can be used as such. - Consequences: Represents a list of terms. It inherits from list and can be used as such. - SO: Represents the ontology itself. It is a singleton and can be used to access the ontology.

The following functions are available for ontologies, where term is a single Term and terms is a Consequences object: - SO.get_id(term: Term) -> str: Convert from term name (e.g. stop_gained) to accession (e.g. SO:0001587). - SO.get_term(id_: str) -> Term: Convert from accession (e.g. SO:0001587) to term name (e.g. stop_gained). - terms.most_specific_terms() -> Consequences: Narrow down the list of terms to the most specific ones, e.g. frameshift_variant&frameshift_truncation&intron_variant&splice_site_variant&splice_donor_5th_base_variant will lead to frameshift_truncation&splice_donor_5th_base_variant. - term.ancestors() -> Consequences: Get all ancestral levels of a term, all the way to the ontology's root node. - term.descendants() -> Consequences: Get all descendant levels of a term, all the way to the ontology's respective leave nodes. - term.parents() -> Consequences: Get immediate parents of a term. - term.children() -> Consequences: Get immediate children of a term. - term.is_a(parent: Term) -> bool: Check if there is a path from term to parent, i.e. whether term is the parent type or a subtype of it. - terms.any_is_a(parent: Term) -> bool: Check if any of the terms is a subtype of parent. - term.is_ancestor(other: Term) -> bool: Check if term is an ancestor of other. - term.is_descendant(other: Term) -> bool: Check if term is a descendant of other. (Same as is_a) - term.path_length(target: Term) -> int | None: Get the shortest path length from term to target or vice versa. Returns None if no path exists.

Citation

Check the "Cite this repository" entry in the sidebar for citation options.

Also, please read should-I-cite-this-software for background.

Authors

Marcel Bargull (@mbargull)
Jan Forster (@jafors)
Till Hartmann (@tedil)
Johannes Köster (@johanneskoester)
Elias Kuthe (@eqt)
David Lähnemann (@dlaehnemann)
Felix Mölder (@felixmoelder)
Christopher Schröder (@christopher-schroeder)

Owner

Name: vembrane
Login: vembrane
Kind: organization

Repositories: 3
Profile: https://github.com/vembrane

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Hartmann"
  given-names: "Till"
  orcid: "https://orcid.org/0000-0002-6993-347X"
- family-names: "Schröder"
  given-names: "Christopher"
  orcid: "https://orcid.org/0000-0002-4483-8951"
- family-names: "Kuthe"
  given-names: "Elias"
  orcid: "https://orcid.org/0000-0002-4704-9536"
- family-names: "Lähnemann"
  given-names: "David"
  orcid: "https://orcid.org/0000-0002-9138-4112"
- family-names: "Köster"
  given-names: "Johannes"
  orcid: "https://orcid.org/0000-0001-9818-9320"
- family-names: "Bargull"
  given-names: "Marcel"
  orcid: "https://orcid.org/0000-0002-7185-3508"
- family-names: "Forster"
  given-names: "Jan"
  orcid: "https://orcid.org/0000-0002-1587-7732"
- family-names: "Mölder"
  given-names: "Felix"
  orcid: "https://orcid.org/0000-0002-3976-9701"
title: "vembrane"
version: 0.13.2
doi: 10.5281/zenodo.7244278
date-released: 2022-10-24
url: "https://github.com/vembrane/vembrane"
preferred-citation:
  type: article
  authors:
  - family-names: "Hartmann"
    given-names: "Till"
    orcid: "https://orcid.org/0000-0002-6993-347X"
  - family-names: "Schröder"
    given-names: "Christopher"
    orcid: "https://orcid.org/0000-0002-4483-8951"
  - family-names: "Kuthe"
    given-names: "Elias"
    orcid: "https://orcid.org/0000-0002-4704-9536"
  - family-names: "Lähnemann"
    given-names: "David"
    orcid: "https://orcid.org/0000-0002-9138-4112"
  - family-names: "Köster"
    given-names: "Johannes"
    orcid: "https://orcid.org/0000-0001-9818-9320"
  doi: "10.1093/bioinformatics/btac810"
  journal: "Bioinformatics"
  month: 12
  year: 2022
  title: "Insane in the vembrane: filtering and transforming VCF/BCF files"

GitHub Events

Total

Create event: 20
Release event: 2
Issues event: 12
Watch event: 4
Delete event: 14
Member event: 1
Issue comment event: 19
Push event: 110
Pull request review comment event: 61
Pull request review event: 49
Pull request event: 30

Last Year

Create event: 20
Release event: 2
Issues event: 12
Watch event: 4
Delete event: 14
Member event: 1
Issue comment event: 19
Push event: 110
Pull request review comment event: 61
Pull request review event: 49
Pull request event: 30

Committers

Last synced: over 3 years ago

All Time

Total Commits: 314
Total Committers: 10
Avg Commits per committer: 31.4
Development Distribution Score (DDS): 0.478

Top Committers

Name	Email	Commits
Till Hartmann	t**n@u**u	164
Christopher Schröder	c**r@t**e	77
Johannes Köster	j**r@u**e	28
jafors	f**n@w**e	22
Elias Kuthe	e**e@t**e	7
Marcel Bargull	m**l@u**m	6
David Laehnemann	d**n@h**e	4
Felix Mölder	f**r@u**e	3
Marcel Bargull	m**l@u**u	2
Jan Forster	3**s@u**m	1

Committer Domains (Top 20 + Academic)

uni-due.de: 2 tu-dortmund.de: 2 udo.edu: 2 hhu.de: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 24
Total pull requests: 148
Average time to close issues: 2 months
Average time to close pull requests: 15 days
Total issue authors: 12
Total pull request authors: 8
Average comments per issue: 2.17
Average comments per pull request: 0.94
Merged pull requests: 124
Bot issues: 0
Bot pull requests: 4

Past Year

Issues: 7
Pull requests: 36
Average time to close issues: 11 days
Average time to close pull requests: 13 days
Issue authors: 5
Pull request authors: 6
Average comments per issue: 1.29
Average comments per pull request: 0.47
Merged pull requests: 23
Bot issues: 0
Bot pull requests: 4

View more stats

Top Authors

Issue Authors

tedil (5)
ptrebert (3)
christopher-schroeder (3)
johanneskoester (2)
dlaehnemann (2)
msierk (2)
sci-kai (2)
zihhuafang (1)
kokyriakidis (1)
jsquaredosquared (1)
dwpeng (1)

Pull Request Authors

tedil (93)
christopher-schroeder (16)
johanneskoester (12)
dlaehnemann (10)
mbargull (6)
EQt (6)
FelixMoelder (4)
github-actions[bot] (4)

Top Labels

Issue Labels

bug (2) maintenance (1)

Pull Request Labels

patch (15) enhancement (13) minor (13) maintenance (12) bug (5) major (3) documentation (2) autorelease: pending (2) help wanted (1)

Dependencies

poetry.lock pypi

atomicwrites 1.4.1 develop
attrs 21.4.0 develop
black 22.6.0 develop
cfgv 3.3.1 develop
click 8.1.3 develop
colorama 0.4.5 develop
distlib 0.3.4 develop
filelock 3.7.1 develop
identify 2.5.1 develop
iniconfig 1.1.1 develop
mypy-extensions 0.4.3 develop
nodeenv 1.7.0 develop
packaging 21.3 develop
pathspec 0.9.0 develop
platformdirs 2.5.2 develop
pluggy 1.0.0 develop
pre-commit 2.20.0 develop
py 1.11.0 develop
pyparsing 3.0.9 develop
pytest 7.1.2 develop
toml 0.10.2 develop
tomli 2.0.1 develop
typing-extensions 4.3.0 develop
virtualenv 20.15.1 develop
asttokens 2.0.5
intervaltree 3.1.0
numpy 1.23.1
pysam 0.19.1
pyyaml 6.0
six 1.16.0
sortedcontainers 2.4.0

pyproject.toml pypi

black ^22.6 develop
pre-commit ^2.20 develop
pytest ^7.1 develop
asttokens ^2.0
importlib_metadata ^1.7.0
intervaltree ^3.1
numpy ^1.23
pysam ^0.19
python >=3.8
pyyaml ^6.0

.github/workflows/main.yml actions

actions/checkout v2 composite
actions/setup-python v1 composite
julianwachholz/flake8-action v1 composite
snok/latest-python-versions v1 composite

.github/workflows/pypi.yml actions

JRubics/poetry-publish v1.8 composite
actions/checkout v2.3.4 composite

.github/workflows/release_drafter.yml actions

release-drafter/release-drafter v5 composite

vembrane

Science Score: 77.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

vembrane: variant filtering using python expressions

Installation

Subcommands

vembrane filter

Usage

Filter expression

Available fields

Examples

Custom ANN types

Missing values in annotations

Auxiliary files

Ontologies

Citation

Authors

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

`vembrane filter`

Custom `ANN` types