cnvnator
a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads
Science Score: 28.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
9 of 21 committers (42.9%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary
Repository
a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads
Basic Info
- Host: GitHub
- Owner: abyzovlab
- License: other
- Language: C++
- Default Branch: master
- Size: 70.2 MB
Statistics
- Stars: 224
- Watchers: 19
- Forks: 68
- Open Issues: 37
- Releases: 0
Metadata Files
README.md
README
Quick start guide
```
Extract read mapping
$ ./cnvnator -root file.root -tree file.bam -chrom 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16\ 17 18 19 20 21 22 X Y OR $ ./cnvnator -root file.root -tree file.bam -chrom $(seq 1 22) X Y OR $ ./cnvnator -root file.root -tree file.bam -chrom chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8\ chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY OR $ ./cnvnator -root file.root -tree file.bam -chrom $(seq -f 'chr%g' 1 22) chrX chrY # If option -chrom is not used all chromosomes from bam file will be extracted.
Generate histogram
$ ./cnvnator -root file.root -his 1000 -d dirwithgenomefa/ OR $ ./cnvnator -root file.root -his 1000 -fasta filegenome.fa.gz OR $ ./cnvnator -root file.root -his 1000 -chrom 1 2 3 4 -fasta file_genome.fa.gz
Calculate statistics
$ ./cnvnator -root file.root -stat 1000
Partition
$ ./cnvnator -root file.root -partition 1000
Call CNVs
$ ./cnvnator -root file.root -call 1000
Import SNP data
$ ./cnvnator -root file.root -vcf file.vcf.gz OR $ ./cnvnator -root file.root -vcf file.vcf.gz -addchr # Options -addchr or -rmchr can be used to add or remove the "chr" prefix from # chromosome names in vcf file to match chromosom names from bam file.
Import mask data
$ ./cnvnator -root file.root -mask mask.fa.gz OR $ ./cnvnator -root file.root -mask mask.fa.gz -addchr
Generate SNP histograms
$ ./cnvnator -root file.root -baf 10000
Ploting
$ ./cnvnator -root file.root -view 10000
1:1M-50M 1:1M-50M baf
List root file content
$ ./cnvnator -root file.root -ls
Copy RD and SNP data to new root file
$ ./cnvnator -root file.root -cptrees new_file.root
Ploting RD and BAF whole genome circular plots using python tool:
$ ./plotcircular.py file.root
```
1. Compilation
Dependencies
You must install ROOT package and set up $ROOTSYS variable (see ROOT documentation here).
Also, a link to the samtools binary should be present in your CNVnator directory together with compiled libhts.a HTSlib library in a htslib* subdirectory.
If compilation is not completed but the file libbam.a has been created, you can continue.
Installation from release zip file (recommended)
See INSTALL for complete details.
Installation from github
``` git clone https://github.com/abyzovlab/CNVnator.git
cd CNVnator
ln -s /path/to/src/samtools samtools
make ```
If make doesn't work, try "make OMP=no" which will disable parallel support.
Installing with Yeppp support
Yeppp is a library which provides high-performance implementations of math functions.
To install with Yeppp support, download Yeppp from here
and extract it to a location of your choice. Set YEPPPLIBDIR and YEPPPINCLUDEDIR directories appropriately.
Typically, for Linux-based systems on x86-64, YEPPPLIBDIR will be yeppp-1.0.0/binaries/linux/x86_64/ and YEPPPINCLUDEDIR will be
yeppp-1.0.0/library/headers.
To build, type
make YEPPPLIBDIR=... YEPPPINCLUDEDIR=...
To disable OpenMP, add OMP=no to the make command.
2. Predicting CNV regions
Running CNVnator involves a few steps outlined below. Chromosome names and lengths are parsed from the input sam/bam file header.
2.1 EXTRACTING READ MAPPING FROM BAM/SAM FILES
$ ./cnvnator -root out.root [-chrom name1 ...] -tree [file1.bam ...] [-lite]
where,
-root out.root -- specifies output ROOT file. See ROOT package documentation.
-chrom name1 ... -- specifies chromosome name(s).
-tree file1.bam ... -- specifies bam file(s) names.
-lite -- use this option to produce a "lighter" (smaller) root file.
Chromosome names must be specified the same way as they are described in the sam/bam header, e.g., chrX or X. One can specify multiple chromosomes separated by space. If no chromosome is specified, read mapping is extracted for all chromosomes in the sam/bam file. Note that this would require machines with a large physical memory of at least 7Gb. Extracting read mapping for subsets of chromosomes is a way around this issue. Also note that the root file is not being overwritten.
Example:
./cnvnator -root NA12878.root -chrom 1 2 3 -tree NA12878_ali.bam
for bam files with a header like this:
@HD VN:1.4 GO:none SO:coordinate
@SQ SN:1 LN:249250621
@SQ SN:2 LN:243199373
@SQ SN:3 LN:198022430
...
or
./cnvnator -root NA12878.root -chrom chr1 chr2 chr3 -tree NA12878_ali.bam
for bam files with a header like this:
@HD VN:1.4 GO:none SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
...
Example:
./cnvnator -root NA12878.root -chrom 4 5 6 -tree NA12878_ali.bam
./cnvnator -root NA12878.root -chrom 7 8 9 -tree NA12878_ali.bam
is equivalent to
./cnvnator -root NA12878.root -chrom 4 5 6 7 8 9 -tree NA12878_ali.bam
2.2 GENERATING A READ DEPTH HISTOGRAM
$ ./cnvnator -root file.root [-chrom name1 ...] -his bin_size [-d dir]
This step is not memory consuming and so can be done for all chromosomes at once. It can also be carried for a subset of chromosomes. Files with individual chromosome sequences (.fa) are required and should reside in the current directory or in the directory specified by the -d option. Files should be named as: chr1.fa, chr2.fa, etc.
2.3 CALCULATING STATISTICS
$ ./cnvnator -root file.root [-chrom name1 ...] -stat bin_size
This step must be completed before proceeding to partitioning and CNV calling.
2.4 RD SIGNAL PARTITIONING
$ ./cnvnator -root file.root [-chrom name1 ...] -partition bin_size [-ngc]
Option -ngc specifies not to use GC corrected RD signal. Partitioning is the most time consuming step.
2.5 CNV CALLING
$ ./cnvnator -root file.root [-chrom name1 ...] -call bin_size [-ngc]
Calls are printed to STDOUT by default. You may redirect them to a file using the redirect operator >
The output columns are as follows:
CNV_type coordinates CNV_size normalized_RD e-val1 e-val2 e-val3 e-val4 q0
where,
normalized_RD -- read depth normalized to 1.
e-val1 -- is calculated using t-test statistics.
e-val2 -- is from the probability of RD values within the region to be in
the tails of a gaussian distribution describing frequencies of RD values in bins.
e-val3 -- same as e-val1 but for the middle of CNV
e-val4 -- same as e-val2 but for the middle of CNV
q0 -- fraction of reads mapped with q0 quality
2.6 REPORTING READ SUPPORT
To find and report read support for deletions and duplications by abnormal read pairs, use the -pe option as below:
./cnvnator -pe file1.bam ... -qual val(20) -over val(0.8) [-f file]
Once prompted, enter a genomic region and the CNV type, e.g.,
```
12:11396601-11436500 del or chr12:11396601-11436500 del ```
Please note that the bin size should be equal to a whole number of 100 bases (e.g., 2500, 3700,…)
2.7 MERGING ROOT FILES
./cnvnator -root out.root [-chrom name1 ...] -merge file1.root ...
Merging can be used when combining read mappings extracted from multiple files.
Note: histogram generation, statistics calculation, signal partitioning, and
CNV calling should be completed/redone after merging.
3. Importing VCF data
To import variant data from VCF file use following option:
./cnvnator -root file.root [-chrom name1 ...] [-rmchr | -addchr] -vcf file.vcf.gz
If chromosome names are not specified, data for all chromosomes from file.vcf.gz will be imported. If
you would like to add or remove the "chr" prefix from your chromosome names, use options -addchr or -rmchr respectively.
It is important that chromosome names in the vcf file and the SAM/BAM file match.
To mark known SNPs from the SNP database:
./cnvnator -root file.root [-chrom name1 ...] [-rmchr | -addchr] -idvar databasefile.vcf.gz
On running the above line, each SNP will be associated with a binary flag which equals 1 if it's in the database.
To mark variants based on genome accessibility using mask file from the 1000 Genomes Project:
./cnvnator -root file.root [-chrom name1 ...] [-rmchr | -addchr] -mask maskfile.fa.gz
On running the above line, each SNP will be associated with a binary flag which equals 1 if it's in the P-region
4. Genotyping genomic regions and visualization
For efficient genotype calculations, we recommend that you sort the list of regions by chromosomes.
./cnvnator -root file.root -genotype bin_size [-ngc]
Once prompted enter a genomic region, e.g.,
```
12:11396601-11436500 or chr12:11396601-11436500 or 12 11396601 11436500 or chr12 11396601 11436500 ```
One can also perform instant visualization by adding the word 'view', e.g.,
```
12:11396601-11436500 view or chr12:11396601-11436500 view or 12 11396601 11436500 view or chr12 11396601 11436500 view ```
Additional notes
For genotyping of multiple regions one can use input piping, e.g.,
./cnvnator -root NA12878.root -genotype 100 << EOF
12:11396601-11436500
22:20999401-21300400
exit
EOF
Another example:
awk '{ print $2 } END { print "exit" }' calls.cnvnator | ./cnvnator -root NA12878.root -genotype 100
4.1 Visualizing specified regions
./cnvnator -root file.root [-chrom name1 ...] -view bin_size [-ngc]
Once prompted, enter a genomic region, e.g.,
```
12:11396601-11436500 or chr12:11396601-11436500 or 12 11396601 11436500 or chr12 11396601 11436500 ```
Additionally, one can specify the length of flanking regions (default is 10 kb) to be displayed as well, e.g.,
```
12:11396601-11436500 100000 or chr12:11396601-11436500 100000 or 12 11396601 11436500 100000 or chr12 11396601 11436500 100000 ```
One can also perform instant genotyping by adding the word 'genotype', e.g.,
```
12:11396601-11436500 genotype or chr12:11396601-11436500 genotype or 12 11396601 11436500 genotype or chr12 11396601 11436500 genotype ```
4.2 Plotting B-allele frequency (BAF)
To plot BAF data along RD use baf option in view mode:
``` ./cnvnator -root file.root -view bin_size
1:1-200000000 baf ```
The resulting output plot has two panels. On the uper panel, black line corresponds to binned RD signal, green to segmentation, and red to calls. On the bottom panel each dot corresponds to BAF value of the SNPs. Colors represent following:
- black - homozygous (1/1 or 1|1) SNPs in P-region of the strict mask,
- grey - homozygous (1/1 or 1|1) SNPs out of P-region of the strict mask,
- blue - heterozygous (0/1 or 0|1) SNPs in P-region of the strict mask,
- cyan - heterozygous (0/1 or 0|1) SNPs out of P-region of the strict mask,
- red - heterozygous (1|0) SNPs in P-region of the strict mask,
- orange - heterozygous (1|0) SNPs out of P-region of the strict mask.
plotbaf.py
Plot BAF data with python tool plotbaf.py (requires numpy, matplotlib installed):
./plotbaf.py [-h] [-bs BINSIZE] [-res RESOLUTION] [-o SAVE_FILE] [-t TITLE]
[-nomask] [-useid] root_file region
Required arguments: * root_file: cnvnator root file name * region: chromosomal coordinates in the format chr:start-end
Optional arguments: * size of bins (default 100,000): -bs BINSIZE, --binsize BINSIZE * likelihood function resolution (default 100): -res RESOLUTION, --resolution RESOLUTION * save plot to file: -o SAVEFILE, --savefile SAVE_FILE * plot title: -t TITLE, --title TITLE * do calculations without mask: -nomask * do calculations using idvar filter: -useid
Output plot consists of four panels. Starting from the top one, they are:
- BAF value for heterozygous SNPs.
- Likelihood function. Light dots on the imagemap represent the most likely value of BAF at each bin.
- Red line represents a distance between maxima positions in likelihood function that is equivalent to twice the absolute difference between most likely BAF value and 0.5. Blue dots represent the ratio between the value of the likelihood function at 0.5 and its maximum value.
- Green dots and blue error-bars correspond to mean MAF and standard deviation per bin, respectively. Bin size is 100k base pairs.
plotrdbaf.py
Plot RD and BAF data with python tool plotrdbaf.py:
./plotrdbaf.py [-h] [-bs BINSIZE] [-rdbs RDBINSIZE] [-res RESOLUTION]
[-o SAVE_FILE] [-t TITLE] [-nomask] [-useid]
root_file region
Required arguments: * root_file: cnvnator root file name * region: chromosomal coordinates in the format chr:start-end
Optional arguments: * size of bins (default 100,000): -bs BINSIZE, --binsize BINSIZE * size of bins for RD signal (default 100,000): -rdbs RDBINSIZE, --rdbinsize RDBINSIZE * likelihood function resolution (default 100): -res RESOLUTION, --resolution RESOLUTION * save plot to file: -o SAVEFILE, --savefile SAVE_FILE * plot title: -t TITLE, --title TITLE * do calculations without mask: -nomask * do calculations using idvar filter: -useid
Output plot consists of four panels. Starting from the top one, they are:
- Read depth (RD) signal.
- BAF value for heterozygous SNPs.
- Likelihood function. Light dots on the imagemap represent the most likely value of BAF at each bin.
- Red line represents a distance between maxima positions in likelihood function that is equivalent to twice the absolute difference between most likely BAF value and 0.5. Blue dots represent the ratio between the value of the likelihood function at 0.5 and its maximum value.
- Green dots and blue error-bars correspond to mean MAF and standard deviation per bin, respectively. Bin size is 100k base pairs.
plotcircular.py
Plot RD and BAF data with python tool plotcircular.py:
./plotcircular.py [-h] [-chrom CHROMOSOMES] [-bs BINSIZE] [-o SAVE_FILE]
[-t TITLE] [-rdbs RDBINSIZE] [-pbs PLOTBINSIZE]
[-nomask] [-useid]
root_file
Required arguments: * root_file: cnvnator root file name
Optional arguments: * comma separated chromosom list: -chrom CHROMOSOMES, --chromosomes CHROMOSOMES * plot bin size (default 1,000,000): -pbs PLOTBINSIZE, --plotbinsize PLOTBINSIZE * size of bins(default 100,000): -bs BINSIZE, --binsize BINSIZE * size of bins for RD signal (default 100,000): -rdbs RDBINSIZE, --rdbinsize RDBINSIZE * save plot to file: -o SAVEFILE, --savefile SAVE_FILE * plot title: -t TITLE, --title TITLE * do calculations without mask: -nomask * do calculations using idvar filter: -useid
Output plot is circular. Inner plot represents RD signal, while outher represents MAF (Minor allele frequency) signal.
5. Exporting CNV calls as VCFs
In order to export your CNV calls as a VCF file, use the script cnvnator2VCF.pl as
cnvnator2VCF.pl -prefix study1 -reference GRCh37 sample1.cnvnator.out /path/to/individual/fasta_files
where,
-prefix specifies a prefix string you want to append to the ID field in your output VCF. For e.g., if you set your -prefix as "study1", then your resulting ID column will be study1CNVnatordel1, study1CNVnatordel2 etc.
-reference stands for the name of reference genome you used, for e.g., GRCh37, hg19 etc.
file.calls is your CNVnator output file with the CNV calls
genome_dir is the directory containing your individual reference fasta files such as 1.fa, 2.fa etc. (or chr1.fa, chr2.fa etc.)
6. Python module: Read CNVnator data from root file
Use python module pytools.io to extract CNVnator data from root file.
import pytools.io
io=pytools.io.IO("file.root")
positions,rd=x.get_signal("1",100000,"RD")
positions2,phased_baf=x.get_signal("1",100000,"SNP baf",flag=pytools.io.FLAG_USEHAP|pytools.io.FLAG_USEMASK)
positions,ybins,likelihood=x.get_signal_2d("1",100000,"SNP likelihood")
List of available signals: * "RD" * "RD unique" * "RD raw" * "RD partition" * "RD call" * "GC" * "SNP count" * "SNP baf" * "SNP maf" * "SNP likelihood"
List of available flags: * RD signal: FLAGGCCORR * RD signal: FLAGATCORR * SNP signal: FLAGUSEMASK * SNP signal: FLAGUSEID * SNP signal: FLAG_USEHAP
Contact Us
Please send your comments and suggestions to abyzov.alexej@mayo.edu
Owner
- Name: Abyzov lab
- Login: abyzovlab
- Kind: organization
- Location: Rochester, MN
- Website: http://abyzovlab.org
- Repositories: 11
- Profile: https://github.com/abyzovlab
Software packages developed in Abyzov's lab
Citation (CITATION)
1 *************************************************************************** Abyzov A, Urban AE, Snyder M, Gerstein M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011 Jun;21(6):974-84. doi: 10.1101/gr.114876.110. 2 *************************************************************************** Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, Yoon SC, Ye K, et al. Mapping copy number variation by population-scale genome sequencing. Nature. 2011 Feb 3;470(7332):59-65. doi: 10.1038/nature09708. 3 *************************************************************************** Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015 Oct 1;526(7571):75-81. doi: 10.1038/nature15394.
GitHub Events
Total
- Issues event: 2
- Watch event: 16
- Issue comment event: 12
- Pull request event: 2
- Fork event: 3
Last Year
- Issues event: 2
- Watch event: 16
- Issue comment event: 12
- Pull request event: 2
- Fork event: 3
Committers
Last synced: over 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| Milovan Suvakov | s****v@g****m | 59 |
| abyzovlab | a****j@m****u | 23 |
| Marghoob Mohiyuddin | m****m@b****m | 10 |
| Abyzov | m****3@r****u | 10 |
| Abyzov | m****3@r****g | 4 |
| Daniel Nilsson | d****n@s****e | 4 |
| ShobanaSekar | s****2@a****u | 3 |
| Taejeong Bae | b****z@g****m | 3 |
| Indraniel Das | i****s@g****u | 2 |
| Alexej Abyzov of group gerstein | a****5@l****l | 2 |
| Abyzov | m****3@r****u | 2 |
| Jason Stajich | j****n@b****g | 2 |
| Abyzov | m****3@r****u | 2 |
| Damien Zammit | d****n@z****m | 1 |
| Suvakov.Milovan@mayo.edu | m****4@m****u | 1 |
| Mike Dacre | m****e@g****m | 1 |
| Joel Martin | J****n@l****v | 1 |
| Jason Stajich | j****h@u****u | 1 |
| arpanda | a****a@g****m | 1 |
| Mariusz Karpiarz | m****z@v****m | 1 |
| Nathan Weeks | 1****s | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 103
- Total pull requests: 2
- Average time to close issues: 10 months
- Average time to close pull requests: 4 days
- Total issue authors: 88
- Total pull request authors: 2
- Average comments per issue: 3.55
- Average comments per pull request: 0.5
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 15
- Pull requests: 0
- Average time to close issues: about 1 month
- Average time to close pull requests: N/A
- Issue authors: 13
- Pull request authors: 0
- Average comments per issue: 2.8
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- chirrie (4)
- MaoSihong (3)
- jingydz (3)
- omG0-hub (2)
- kghimire09 (2)
- lgmgeo (2)
- neharajkumar (2)
- yasin-uzun (2)
- FerFrancis (2)
- gevro (2)
- assane-mbodj (2)
- mwaldron104 (1)
- QuanLG (1)
- BIjoy92 (1)
- AJ211 (1)
Pull Request Authors
- Lightoscope (1)
- joelmartin (1)
- arpanda (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
- Total downloads: unknown
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 1
spack.io: cnvnator
A tool for CNV discovery and genotyping from depth-of-coverage by mapped reads.
- Homepage: https://github.com/abyzovlab/CNVnator
- License: []
-
Latest release: 0.3.3
published about 4 years ago