vcf-reformatter

๐Ÿงฌ High-performance VCF file parser and reformatter with VEP annotation support. Converts complex VCF files to analyzable TSV format with intelligent transcript handling.

https://github.com/flalom/vcf-reformatter

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • โœ“
    CITATION.cff file
    Found CITATION.cff file
  • โœ“
    codemeta.json file
    Found codemeta.json file
  • โœ“
    .zenodo.json file
    Found .zenodo.json file
  • โ—‹
    DOI references
  • โ—‹
    Academic publication links
  • โ—‹
    Academic email domains
  • โ—‹
    Institutional organization owner
  • โ—‹
    JOSS paper metadata
  • โ—‹
    Scientific vocabulary similarity
    Low similarity (11.7%) to scientific vocabulary

Keywords

bioinformatics computational-biology file-format-conversion genomi hpc research-tool rust variant-calling
Last synced: 6 months ago · JSON representation ·

Repository

๐Ÿงฌ High-performance VCF file parser and reformatter with VEP annotation support. Converts complex VCF files to analyzable TSV format with intelligent transcript handling.

Basic Info
Statistics
  • Stars: 36
  • Watchers: 1
  • Forks: 3
  • Open Issues: 0
  • Releases: 3
Topics
bioinformatics computational-biology file-format-conversion genomi hpc research-tool rust variant-calling
Created 8 months ago · Last pushed 7 months ago
Metadata Files
Readme License Citation

README.md

VCF Reformatter: What is it?

Did it ever happen that you had VCF files and you wanted to have a look at the data as you would do with a normal table? VCF Reformatter is here for your rescue!

A Rust command-line tool for parsing and reformatting VCF (Variant Call Format) files, with support for VEP (Variant Effect Predictor) and SnpEff annotations. This tool flattens complex VCF files into tab-separated values (TSV) format for easier downstream analysis. Also incredibly useful for quick checks to your data!

VCF Reformatter

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Rust](https://img.shields.io/badge/rust-1.70+-blue.svg)](https://www.rust-lang.org) [![Build Status](https://img.shields.io/badge/build-passing-brightgreen.svg)]() [![Performance](https://img.shields.io/badge/performance-10k--30k%20variants%2Fsec-green.svg)]() [![Release](https://img.shields.io/github/v/release/flalom/vcf-reformatter)](https://github.com/flalom/vcf-reformatter/releases) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-purple.svg?style=flat)](https://anaconda.org/bioconda/vcf-reformatter) [![Conda](https://anaconda.org/bioconda/vcf-reformatter/badges/version.svg)](https://anaconda.org/bioconda/vcf-reformatter) **Transform complex VCF files into clean, analyzable tables with ease** *A high-performance Rust tool for flattening VCF files with intelligent VEP and SnpEff annotation handling*

๐Ÿš€ Quick Start

```` bash

Download binary from releases (easiest! You download and use it)

wget https://github.com/flalom/vcf-reformatter/releases/latest/download/vcf-reformatter-v0.3.0-linux-x8664 chmod +x vcf-reformatter-v0.3.0-linux-x8664

Transform your VCF file

./vcf-reformatter-v0.3.0-linux-x86_64 sample.vcf.gz

Generate MAF output โš ๏ธ (in beta!)

./vcf-reformatter-v0.3.0-linux-x86_64 sample.vcf.gz --output-format maf ` OR Via Bioconda bash conda install -c bioconda vcf-reformatter

or

mamba install vcf-reformatter -c bioconda

OR install from [crates.io](https://crates.io/crates/vcf-reformatter): bash cargo install vcf-reformatter OR build from source (you need Rust toolchain): bash git clone https://github.com/flalom/vcf-reformatter.git cd vcf-reformatter cargo build --release ./target/release/vcf-reformatter sample.vcf.gz ```

โš ๏ธ Experimental MAF support

MAF output is currently in beta testing (v0.3.0). Known limitations:

  • VAF calculation needs refinement for some genotype patterns
  • Multi-sample handling requires validation
  • Use with caution in production workflows

Memory considerations for MAF: - Files >100K variants: Monitor memory usage - Files >1M variants: Ensure adequate RAM (16GB+)

๐ŸŽฏ Why VCF Reformatter?

The Problem: VCF files are notoriously difficult to analyze. Complex nested annotations, semicolon-separated INFO fields, and multi-transcript VEP annotations make downstream analysis a nightmare.

The Solution: VCF Reformatter flattens everything into clean, readable TSV format that works seamlessly with Excel, R, Python, and any analysis tool (โš ๏ธ beware Excel auto-correction!).

Before & After

Before (Raw VCF): chr1 69511 . A G 1294.53 . DP=65;AF=1;CSQ=G|missense_variant|MODERATE|OR4F5|ENSG00000186092... After (Reformatted TSV): CHROM POS REF ALT QUAL INFO_DP INFO_AF CSQ_Allele CSQ_Consequence CSQ_SYMBOL chr1 69511 A G 1294.53 65 1 G missense_variant OR4F5

โœจ Key Features

| Feature | Description | Benefit | |-----------------------------------------|--------------------------------------------------|------------------------------------------------------| | ๐Ÿงฌ VEP/SnpEff Annotation Parsing | Intelligent handling of CSQ/ANN annotations | No more manual parsing of complex VEP/SnpEff output | | ๐Ÿ‘€ Automatic Annotation Recognition | Automatic detection of CSQ/ANN annotations | Saving even more time now for both VEP and SnpEff | | ๐Ÿ”€ Smart Transcript Handling | Most severe, first only, or split transcripts | Choose the analysis approach that fits your needs | | ๐Ÿš€ Parallel Processing | Multi-threaded processing up to 30k variants/sec | Process large cohorts in minutes, not hours | | ๐Ÿ“ Native Compression | Direct .vcf.gz reading & gzip output | Seamless workflow with compressed/uncompressed files | | ๐ŸŽฏ Production Ready | Comprehensive error handling & logging | Reliable for automated pipelines | | ๐Ÿณ Container Support | Docker & Singularity ready | Deploy anywhere, from laptops to HPC clusters |


๐Ÿ“ฆ Installation

Option 1: Download Pre-compiled Binaries (Easiest!)

No Rust installation required - just download and run:

  1. Go to Releases
  2. Download the binary for your platform:

    • vcf-reformatter-v0.3.0-linux-x86_64 โ†’ Linux (most users)
    • vcf-reformatter-v0.3.0-linux-x86_64-static โ†’ HPC clusters (works everywhere)
    • vcf-reformatter-v0.3.0-windows-x86_64.exe โ†’ Windows
    • vcf-reformatter-v0.3.0-macos-x86_64 โ†’ Intel Mac
    • vcf-reformatter-v0.3.0-macos-arm64 โ†’ Apple Silicon Mac (M1/M2/M3/M4)
  3. Make executable and run: ````bash

    Linux/Mac

    chmod +x vcf-reformatter-* ./vcf-reformatter-* --help

Windows

Just double-click or run from command prompt

C++ might be required, if not already installed

````

Option 2: Build from Source

bash git clone https://github.com/flalom/vcf-reformatter.git cd vcf-reformatter cargo build --release

Option 3: Docker

```shell script

Build the container

docker build -t vcf-reformatter .

Run with your data

docker run --rm -v $(pwd):/data vcf-reformatter /data/sample.vcf.gz ```

Option 4: Singularity

```shell script

Build Singularity image

singularity build vcf-reformatter.sif Singularity

Run on HPC cluster

singularity run --bind $PWD:/data vcf-reformatter.sif /data/sample.vcf.gz -j 16 ```

๐Ÿ› ๏ธ Usage

Basic Usage

```shell script

Simple conversion

vcf-reformatter input.vcf.gz

Most severe consequence only (recommended for analysis)

vcf-reformatter input.vcf.gz -t most-severe

All transcripts in separate rows (comprehensive)

vcf-reformatter input.vcf.gz -t split ```

Annotation Type Detection

```shell script

Auto-detect annotation type (recommended)

vcf-reformatter input.vcf.gz -a auto

Force VEP processing

vcf-reformatter vep_annotated.vcf.gz -a vep -t most-severe

Force SnpEff processing

vcf-reformatter snpeff_annotated.vcf.gz -a snpeff -t most-severe ```

Advanced Usage

```shell script

High-performance processing with compression

vcf-reformatter largecohort.vcf.gz \ --transcript-handling most-severe \ --threads 0 \ --compress \ --output-dir results/ \ --prefix myanalysis \ --verbose

Optimized for HPC environments

vcf-reformatter huge_dataset.vcf.gz -t most-severe -j 32 -o /scratch/results/ -c -v ```

Complete Options

``` Usage: vcf-reformatter [OPTIONS]

Arguments: Input VCF file (supports .vcf.gz)

Options: --output-format Output format [default: tsv] [values: tsv, maf] --center

Sequencing center for MAF output
--ncbi-build Genome build [default: GRCh38] --sample-barcode Sample identifier for MAF output -t, --transcript-handling How to handle multiple transcripts [default: first] [values: most-severe, first, split] -a, --annotation-type Which annotations to parse VEP/SnpEff [default: auto] [values: snpeff, vep, auto] -j, --threads Thread count (0 = auto-detect) [default: 1] -o, --output-dir Output directory [default: current] -p, --prefix Output file prefix [default: input filename] -c, --compress Compress output with gzip -v, --verbose Detailed performance statistics -h, --help Show help -V, --version Show version ```

๐Ÿงฌ Transcript Handling Modes

VCF files with VEP annotations often contain multiple transcript annotations per variant. Choose the strategy that fits your analysis:

๐ŸŽฏ Most Severe (--transcript-handling most-severe)

Best for: Clinical analysis, variant prioritization ```shell script vcf-reformatter input.vcf.gz -t most-severe

for maf output

vcf-reformatter input.vcf.gz -t most-severe --output-format maf ``` Selects the transcript with the most severe consequence (stopgained > missensevariant > synonymous, etc.)

โšก First Only (--transcript-handling first) [Default]

Best for: Quick analysis, performance-critical workflows shell script vcf-reformatter input.vcf.gz # Uses first transcript by default

Processes only the first transcript annotation (fastest option)

๐Ÿ“Š Split All (--transcript-handling split)

Best for: Comprehensive analysis, transcript-level studies shell script vcf-reformatter input.vcf.gz -t split Creates separate rows for each transcript (most detailed output)

๐Ÿ“ˆ Performance

Benchmarks

  • Small files (< 1K variants): ~5,000 variants/sec
  • Medium files (1K-10K variants): ~15,000 variants/sec
  • Large files (10K+ variants): ~30,000 variants/sec

Optimization Tips

```shell script

Auto-detect optimal thread count

vcf-reformatter input.vcf.gz -j 0

For files > 10K variants, use parallel processing

vcf-reformatter input.vcf.gz -t most-severe -j 0 -v

Combine with compression for large outputs

vcf-reformatter input.vcf.gz -t split -j 0 -c -v ```

๐Ÿ“Š Output Format

File Structure

VCF Reformatter generates two files: - {prefix}_header.txt - Original VCF header and metadata - {prefix}_reformatted.tsv - Flattened tabular data

Column Types

  1. Standard VCF: CHROM, POS, ID, REF, ALT, QUAL, FILTER
  2. INFO Fields: INFO_DP, INFO_AF, INFO_AC, etc.
  3. VEP Annotations: CSQ_Allele, CSQ_Consequence, CSQ_SYMBOL, CSQ_Gene, etc.
  4. SnpEff Annotations: ANN_Allele, ANN_Annotation_Impact, ANN_Gene_Name, ANN_Distance, etc.
  5. Sample Data: SAMPLE1_GT, SAMPLE1_DP, SAMPLE1_AD, etc.

Example Output VEP

CHROM POS ID REF ALT QUAL FILTER INFO_DP CSQ_Consequence CSQ_SYMBOL SAMPLE1_GT chr1 69511 . A G 1294.53 PASS 65 missense_variant OR4F5 1/1 chr1 69761 rs123 C T 892.15 PASS 42 synonymous_variant OR4F5 0/1

Example Output SnpEff

CHROM POS ID REF ALT QUAL FILTER INFO_DP ANN_Annotation ANN_Gene_Name SAMPLE1_GT chr1 69761 rs587 C T 730 PASS . 214 synonymous_variant OR4F5 0/1 chr1 924024 . A G 53 PASS . 409 5_prime_UTR_variant SAMD11 1/1

๐Ÿ”ง Integration Examples

With R

```textmate

Read compressed output directly

library(data.table) data <- fread("output_reformatted.tsv.gz")

Quick variant summary

summary(data$CSQ_Consequence) ```

With Python

```textmate import pandas as pd

Load and analyze

df = pd.readcsv("outputreformatted.tsv.gz", sep="\t", compression="gzip") df['CSQConsequence'].valuecounts() ```

In Workflows

```shell script

Nextflow pipeline

vcf-reformatter ${vcf} -t most-severe -j ${task.cpus} -o results/ -c

Snakemake rule

shell: "vcf-reformatter {input.vcf} -t most-severe -j {threads} -o {params.outdir} -c" ```

๐Ÿณ Container Usage

Docker

```shell script

Build once

docker build -t vcf-reformatter .

Run anywhere

docker run --rm \ -v $(pwd):/data \ vcf-reformatter \ /data/input.vcf.gz \ -t most-severe -j 4 -o /data/results/ -c ```

Singularity (HPC)

```shell script

On HPC cluster

singularity run \ --bind $PWD:/data \ --bind /scratch:/scratch \ vcf-reformatter.sif \ /data/large_cohort.vcf.gz \ -t most-severe -j 16 -o /scratch/results/ -c -v ```

๐Ÿงช Use Cases

| Use Case | Command | Why It Works | |----------|---------|--------------| | Clinical Variant Review | vcf-reformatter variants.vcf.gz -t most-severe | Prioritizes clinically relevant consequences | | Population Analysis | vcf-reformatter cohort.vcf.gz -t first -j 0 -c | Fast processing of large cohorts | | Transcript Studies | vcf-reformatter genes.vcf.gz -t split -v | Comprehensive transcript-level analysis | | Quick Data Exploration | vcf-reformatter sample.vcf.gz | Simple, fast conversion for immediate analysis | | HPC Batch Processing | vcf-reformatter huge.vcf.gz -t most-severe -j 32 -c | Optimized for high-performance computing |

๐Ÿš€ What's New in v0.3.0

  • โœ… MAF Output Support (in Betaโš ๏ธ) - Direct conversion to Mutation Annotation Format
  • โœ… Auto-metadata Detection (in Betaโš ๏ธ) - Extracts center/sample info from VCF headers for MAF
  • โœ… Memory-Efficient Processing (streaming) - Chunked streaming for large files (>>100K variants)
  • โœ… Enhanced Error Handling - Better processing of malformed files
  • โœ… Comprehensive Testing - 70+ test cases ensure reliability

Previous Releases

๐Ÿš€ What's New in v0.2.0

  • โœ… SnpEff Support - Full ANN field parsing with intelligent detection
  • โœ… Smart Auto-Detection - Automatically identifies VEP vs SnpEff annotations
  • โœ… Enhanced Error Handling - Better processing of malformed or headerless files

TODOs

  • ~~Add SnpEff supportโœ…~~
  • ~~Output MAF format optionโœ…~~
  • Add stdin to combine with other tools, such as bcftools
  • Support for multi-sample VCF files in MAF output

๐Ÿค Contributing

We welcome contributions! Here's how to get started:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Add tests for new functionality
  4. Commit your changes: git commit -am 'Add feature'
  5. Push to the branch: git push origin feature-name
  6. Submit a pull request

Development Setup

shell script git clone https://github.com/flalom/vcf-reformatter.git cd vcf-reformatter cargo test # Run the test suite cargo run -- data/sample.vcf.gz -v # Test with sample data

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

  • VCF Format Contributors - For the standard that enables genomic data sharing
  • VEP Team - For the powerful variant annotation framework
  • Rust Community - For the incredible ecosystem that makes this possible
  • Bioinformatics Community - For feedback and feature requests

Frequently Asked Questions

Q: Which transcript handling mode should I use?

  • Clinical analysis: --transcript-handling most-severe
  • Quick exploration: --transcript-handling first
  • Comprehensive analysis: --transcript-handling split

Q: How does this compare to other VCF tools?

VCF Reformatter is specifically designed for: - Converting complex VEP/SnpEff annotations to tabular format - Handling multiple transcripts intelligently - High-performance parallel processing - Easy integration with R/Python workflows

Q: Can I use this in production pipelines?

Yes! VCF Reformatter is designed for production use with: - Comprehensive error handling - Docker/Singularity support - Automated testing - Stable CLI interface

Q: What's the difference between TSV and MAF output?

  • TSV: Direct flattening of VCF fields (default)
  • MAF (beta): Standardized cancer genomics format for downstream tools

Q: What if I get out-of-memory errors?

  • Use TSV format instead of MAF: vcf-reformatter file.vcf.gz -j 0 -c
  • Enable verbose mode to monitor: vcf-reformatter file.vcf.gz -v

๐Ÿ“ž Support


**โญ Star this repo if VCF Reformatter helps your research!** Made with โค๏ธ by [Flavio Lombardo](https://github.com/flalom)

Owner

  • Name: Flavio Lombardo
  • Login: flalom
  • Kind: user
  • Location: Switzerland

Bioinformatics|Computational biology|Data science๐Ÿงฌ๐Ÿ”ฌ๐Ÿ–ฅ๏ธ๐Ÿงฎ๐Ÿ’Š๐Ÿค’

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use VCF Reformatter in your research, please cite it as below."
title: "VCF Reformatter: High-performance VCF file parser and reformatter with VEP and SnpEff annotation support"
version: "0.2.0"
date-released: "2025-07-23"
url: "https://github.com/flalom/vcf-reformatter"
repository-code: "https://github.com/flalom/vcf-reformatter"
doi: "10.5281/zenodo.16354810"
type: software
license: MIT

authors:
  - family-names: "Lombardo"
    given-names: "Flavio"
    orcid: "https://orcid.org/0000-0002-4853-6838"
    affiliation: "University Hospital Basel and University of Basel"
    email: "fl@flaviolombardo.site"

abstract: >
  VCF Reformatter is a high-performance Rust command-line tool for parsing and
  reformatting VCF (Variant Call Format) files, with comprehensive support for both
  VEP (Variant Effect Predictor) and SnpEff annotations. The tool flattens complex
  VCF files into tab-separated values (TSV) format for easier downstream analysis,
  featuring intelligent transcript handling, auto-detection of annotation types,
  and parallel processing for high-throughput genomic workflows.

keywords:
  - bioinformatics
  - genomics
  - VCF
  - variant calling
  - VEP annotations
  - SnpEff annotations
  - file format conversion
  - parallel processing
  - rust
  - computational biology
  - variant effect predictor

GitHub Events

Total
  • Create event: 3
  • Release event: 2
  • Issues event: 2
  • Watch event: 21
  • Delete event: 2
  • Issue comment event: 6
  • Push event: 5
  • Fork event: 2
Last Year
  • Create event: 3
  • Release event: 2
  • Issues event: 2
  • Watch event: 21
  • Delete event: 2
  • Issue comment event: 6
  • Push event: 5
  • Fork event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 2
  • Total pull requests: 0
  • Average time to close issues: 15 days
  • Average time to close pull requests: N/A
  • Total issue authors: 2
  • Total pull request authors: 0
  • Average comments per issue: 5.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 0
  • Average time to close issues: 15 days
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 5.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ksarathbabu (1)
  • Aljumiliy1 (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cargo 510 total
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 2
  • Total maintainers: 1
crates.io: vcf-reformatter

Fast VCF file parser and reformatter with VEP and SnpEff annotation support which can output to MAF

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 510 Total
Rankings
Dependent repos count: 20.8%
Stargazers count: 24.5%
Dependent packages count: 27.5%
Forks count: 30.5%
Average: 39.6%
Downloads: 94.6%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/ci.yml actions
  • actions/cache v3 composite
  • actions/checkout v4 composite
  • docker/build-push-action v5 composite
  • docker/setup-buildx-action v3 composite
  • dtolnay/rust-toolchain stable composite
.github/workflows/release.yml actions
  • actions/cache v3 composite
  • actions/checkout v4 composite
  • docker/build-push-action v5 composite
  • docker/metadata-action v5 composite
  • docker/setup-buildx-action v3 composite
  • dtolnay/rust-toolchain stable composite
  • softprops/action-gh-release v1 composite
Cargo.lock cargo
  • adler2 2.0.1
  • aho-corasick 1.1.3
  • anstream 0.6.19
  • anstyle 1.0.11
  • anstyle-parse 0.2.7
  • anstyle-query 1.1.3
  • anstyle-wincon 3.0.9
  • bitflags 2.9.1
  • cfg-if 1.0.1
  • clap 4.5.41
  • clap_builder 4.5.41
  • clap_derive 4.5.41
  • clap_lex 0.7.5
  • colorchoice 1.0.4
  • crc32fast 1.4.2
  • crossbeam-deque 0.8.6
  • crossbeam-epoch 0.9.18
  • crossbeam-utils 0.8.21
  • either 1.15.0
  • errno 0.3.13
  • fastrand 2.3.0
  • flate2 1.1.2
  • getrandom 0.3.3
  • heck 0.5.0
  • hermit-abi 0.5.2
  • is_terminal_polyfill 1.70.1
  • libc 0.2.174
  • linux-raw-sys 0.9.4
  • memchr 2.7.5
  • miniz_oxide 0.8.9
  • num_cpus 1.17.0
  • once_cell 1.21.3
  • once_cell_polyfill 1.70.1
  • proc-macro2 1.0.95
  • quote 1.0.40
  • r-efi 5.3.0
  • rayon 1.10.0
  • rayon-core 1.12.1
  • regex 1.11.1
  • regex-automata 0.4.9
  • regex-syntax 0.8.5
  • rustix 1.0.7
  • strsim 0.11.1
  • syn 2.0.104
  • tempfile 3.20.0
  • unicode-ident 1.0.18
  • utf8parse 0.2.2
  • wasi 0.14.2+wasi-0.2.4
  • windows-sys 0.59.0
  • windows-targets 0.52.6
  • windows_aarch64_gnullvm 0.52.6
  • windows_aarch64_msvc 0.52.6
  • windows_i686_gnu 0.52.6
  • windows_i686_gnullvm 0.52.6
  • windows_i686_msvc 0.52.6
  • windows_x86_64_gnu 0.52.6
  • windows_x86_64_gnullvm 0.52.6
  • windows_x86_64_msvc 0.52.6
  • wit-bindgen-rt 0.39.0
Cargo.toml cargo