vcfman
Pipeline for manipulating and processing of VCF files. Standardize outputs and downstream plotting for overview of relevant metrics and variant distribution.
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary
Repository
Pipeline for manipulating and processing of VCF files. Standardize outputs and downstream plotting for overview of relevant metrics and variant distribution.
Statistics
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 6
- Releases: 0
Metadata Files
README.md
vcfMAN
Pipeline consists of scripts (R and bash) for processing Variant Call Format (VCF) files. The purpose of this pipeline is to standardize output and perform downstream plotting for overview of relevant variant metrics and variant distribution. Pipeline takes VCF files related to both Structural Variants (SV) and small variants (variants ≤ 50 bp). The main outputs are tables, BED formatted variant annotations and figures. In addition, tools for visualizing BED coverage are also available.
Instructions on how to execute
- Download repo to local machine with:
wget https://github.com/mattssca/vcfMAN/archive/refs/heads/main.zip - Unpack with:
unzip -a main.zip - Set directory as current directory:
cd vcfMAN-main/ - Install dependencies:
sh install_dep.sh - Migrate VCF files (SVs/mall variants) to corresponding directory (pipeline also takes compressed VCF files in .gzip format)
- Execute master script with:
sh vcf_man.shfollwed by command line argument for input VCF file. Valid commands are;SVs,small_variantsandboth - Example for running pipeline on small variant VCF
sh vcf_man.sh small_variants - If BED (dipcall) is available, user could also execute
sh bed_man.shto perform visualizations on the input. Inpot BED file is to be stored inin/BED. Script will automatically add BED plot to finalied report. - Output files (figures, tables, summaries and reports) are generated and saved to corresponding output folder (out/SVs or out/small_variants)
Flowchart
Overview of associated processes and workflow described in vcfMAN.
- vcfMAN.sh (vcf_man.sh) acts as a master script and calls appropiate scripts based on the user input (SVs, small variants or both)
- 01-gunzip.sh (unpacksmallvariants.sh and unpackSVs.sh) checks if input VCFs are compressed with Gzip, if so, VCF files are extracted from compressed format.
- 02-read_vcf.R (readvcfsmallvariants.R and readvcfstructuralvaiants.R) initially performs data wrangling associated tasks in order to extract relevant information from input files. This is done a few different steps. The steps are;
- List VCf files located in /in folder. Strips the path and saves the sample name as a variable.
- Read listed VCF files into the R environment, skipping header (header is being extracted and saved in /out folder).
- Subset input VCF on relevant variables.
- Format genotype field.
- Format INFO field, to get variant-type (e.g deletion, duplication, etc.).
- Rename variable-names (e.g chromosome, start, end, genotype, etc.),
- Creates two new variables based on n nucleotides present in the alt and ref column.
- Compute SV length (alt nucleotide count - ref nucleotide count). Deletions are defined as negative sv_lenght values.
- Negative values are transformed into absolute values.
- Compute variant end-coordinates (start coordinate + sv_lenght).
- Genotype information is converted into character-string.
- End-coordinates for deleterious variants are transformed into start coordinates + 1.
- SV sub-types are defined (i.g SIMPLEDEL = del, SUBSDEL = del, CONTRAC = del, DUP = dup, SIMPLEINS = ins).
- Variants not belonging to specified input VCF are subset and removed from downstream analysis (i.e for small variants, variants > 50 bp are removed, and for SV calls variants ≤ 50 bp) Non hard coded genotypes are also subset and removed in a similar approach. Metrics related to removed variants are printed to console and saved as individual .txt files allowing for more in-depth interrogation of such variants.
- Data frame is sorted and exported as .tsv (main input for plotting).
- Summary metrics are generated and exported.
- Summary tables is printed to console and exported as .png.
- 03-plot.R (plot_smallvariants.R and plot_structuralvariants.R) main plotting scripts utilizing R base functions as well as thirdparty R packages (listed in dependencies) to generate all associated plots.
- In order to construct circos plots (using Rcircos package), generated BED files are merged (overlapping regions) using bedtools bedtools.sh. Merged bedfiles are called by Rcircos script Rcircos.R to produce circos plots. If any additional regions are to be plotted as individual tracks or intersecting regions, see instructions in Rcircos.R.
- 04-img.sh (imgmansmallvariants.sh, imgmanstructuralvariants.sh and imgmancombine.sh) are called to format variant reports in pdf. Scripts are combining tables and figures to a standardized report that can be used for interrogating call-set quality and varaint distributions. Individual plots are also avaialble in out/fig folders.
- bedman.sh (bed_man.sh) is called to plot coverage of input BED file. This function is incorporated as an additional function and will not be called per default when executung vcfman.sh. Script is called with same parameters as main vcfman.sh script (with input type specified as such; smallvariants, SVs or both). Shell script executes R script (readbeddip_variants.R) to perform data wrangling and plotting. Additional shell script is called to stitch BED coverage plot to an existing report.

Example Output
Brief overview and comments on output figures and tables.
Structural Variants
SV Size Distribution
Violin plot visualizing size distributions of SVs (deletions, duplications and insertions). Variant sizes in log10 scale on y-axis and sub types of SVs (deletions, duplications and insertions) on x-axis. Black dot annotates mean variant size.

SV Distribution per Chromosome (stacked)
Stacked histogram depicting variant distributions in chromosome-dependent manner. Y-axis shows number of variants (n) and chromosomes are arranged on the x-axis. Note, certain chromosomes (e.g chr2, chr16 and chr19) typically shows higher fractions of SVs compared to other chromosomes. This is typically related to complex and difficult to map regions (segmental duplications, homypolymeres and long repetitive sequences) that are shown to be enriched for SVs (Chaisson, Mark J P et al. “Multi-platform discovery of haplotype-resolved structural variation in human genomes.” Nature communications vol. 10,1 1784. 16 Apr. 2019, doi:10.1038/s41467-018-08148-z)

Binned SV Sizes
Histogram with set bin sizes showing size distributions of SVs. Bin sizes can easily be configured to match specific desires (lines 84 and 87 of plot_structuralvariants.R). y-axis depicts the actual number (n) of variants residing in each bin and bins are shown along the x-axis.

Circos Plot
Ideogram with cytogentic bands visualized as a circos plot. Each variant sub type is shown as its own individual track (colors follow the same pattern as for violin and chromosome distribution plots). User can also supply additional BED-tracks to visualize any genomic feature of interest (e.g specific genomic regions, medically relevant genes, etc). In order to use this feature, the user needs to specify additional BED-tracks on line 179 and 215 of plot_structuralvariants.R. For additional functions and features please see Rciorcos documentation

Tables and Summaries
Summaries are exported as png files and located in out/SVs/fig. In addition to summary figures, tables are also generated. These include:
- VCF header (previously exported).
- non-structural variants - text file annotating all variants < 50 bp.
- Spreadsheet (.xlsx) with different pages for each metric and subset
- BED formatted txt file with chr | start | end | lenght | sv type | sv family | genotype
Small Variants
Small variants Size Distribution
Violin plot visualizing size distributions of small variants (i.g deletions and duplications) with SNVs excluded. Variant sizes in log10 scale on y-axis and small variant sub type on x-axis. Black dot annotates mean variant size.

SNV Distribution per Chromosome
Histogram showing the number of SNVs sorted on chromosome with chromosomes on the y-axis and number of SNVs on the x-axis.

SNV Distance
Plot showing the distribution of distances between neighboring SNVs with SNV distances above 3rd quantile excluded, to compensate for variants with exceptionally long distance between each other (e.g variants on opposite sides of the same centromere). Mean SNV distance for each chromosome is shown with black line inside each box. Metric can be used to understand the breadth (coverage) of called SNVs compared to an expected output (SNV occurs on average every 1000 nucleotide). Plot also shows if any chromosome shows an increase or decrease of SNVs compared to other chromosmes (i.e even distribution of variants).
¨
SNV Ideogram
Horizontally alligned ideogram highlighting SNVs in a genomic context. Plot makes use of chromosome lenghts located in dep/ folder. Currently all coordinates are in respect to grch38 and regions excluded (centromeres) are also in refernce to the same build. Tables can be customized to accomodate for other versions of the reference genome, as well as blacklsited regions can also be added to further exclude specific genomic regions.

BED Coverage plot
For each chromosome, number of bases included in BED file, divided by total number of bases for that chromosome, expressed in percentages. The total n bases in BED file as a fraction of the complete genome (grch38).

Tables and Summaries
Summaries are exported as png files and located in out/amll_variants/fig. In addition to summary figures, tables are also generated. These include:
- VCF header (previously exported).
- non-small variants variants - text file annotating all variants > 50 bp.
- non-hardcoded genotypes (i.e 1|2, 2|1) - text file
- Spreadsheet (.xlsx) with different pages for each metric and subset
- BED formatted txt file with chr | start | end | lenght | sv type | genotype
Dependencies
Pipeline is designed to work on MacOSX systems. Disclaimer, pipeline has not been tested on either Windows or Linux systems. In order to install all dependencies, execute install.dep.sh
| Package | Enviroment | Version | | ------- | ---------- | ------- | | Brew | MacOSX | 3.2.0 | | Bedtools | MacOSX | 2.30.0 | | wget | MacOSX | 1.21.2 | | imagemagick | C | 7.1.0 | | PhantomJS | C | 2.1.1 | | Webshot | R | 0.5.1 | | stringr | R | 1.4.0 | | table1 | R | 1.4.2 | | dplyr | R | 2.1.1 | | knitr | R | 1.3.4 | | devtools | R | 2.4.2 | | gridExtra | R | 2.3 | | ggthemr | R | 1.1.0 | | BiocManager | R | 1.30.16 | | karyoploteR | R | 1.18.0 | | openxlsx | R | 4.2.4 | | RCircos | R | 1.2.1 | | psych | R | 2.1.6 | | data.table | R | 1.14.0 |
Owner
- Name: Carl-Adam Mattsson
- Login: mattssca
- Kind: user
- Location: Vancouver, BC
- Twitter: mattssca
- Repositories: 1
- Profile: https://github.com/mattssca
Research Programmer Ryan Morin Lab Canada’s Michael Smith Genome Sciences Centre BC Cancer Research
Citation (CITATION.cff)
cff-version: 1.2.0
title: vcfMAN
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Carl Adam
name-particle: Carl
family-names: Mattsson
email: mattsada@gmail.com
orcid: 'https://orcid.org/0000-0002-6318-7912'
repository-code: 'https://github.com/mattssca/vcfMAN'
url: 'https://github.com/mattssca/vcfMAN/blob/main/README.md'
abstract: >-
Pipeline consists of scripts (R and bash) for processing
Variant Call Format (VCF) files. The purpose of this
pipeline is to standardize output and perform downstream
plotting for an overview of relevant variant metrics and
variant distribution.
keywords:
- vcf
- Bioinformatics
- Variant Caller
- Vizualization
license: MIT
commit: fe7f48fc193e0ab4df67748846a8dfcaff6f8526
version: '1'
date-released: '2021-11-12'