cnvnator

a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads

https://github.com/abyzovlab/cnvnator

Last synced: 10 months ago · JSON representation ·

Repository

a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads

Basic Info

Host: GitHub
Owner: abyzovlab
License: other
Language: C++
Default Branch: master
Size: 70.2 MB

Statistics

Stars: 224
Watchers: 19
Forks: 68
Open Issues: 37
Releases: 0

Created almost 12 years ago · Last pushed over 4 years ago

Metadata Files

Readme License Citation

README

Quick start guide

```

Extract read mapping

$ ./cnvnator -root file.root -tree file.bam -chrom 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16\ 17 18 19 20 21 22 X Y OR $ ./cnvnator -root file.root -tree file.bam -chrom $(seq 1 22) X Y OR $ ./cnvnator -root file.root -tree file.bam -chrom chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8\ chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY OR $ ./cnvnator -root file.root -tree file.bam -chrom $(seq -f 'chr%g' 1 22) chrX chrY # If option -chrom is not used all chromosomes from bam file will be extracted.

Generate histogram

$ ./cnvnator -root file.root -his 1000 -d dirwithgenomefa/ OR $ ./cnvnator -root file.root -his 1000 -fasta filegenome.fa.gz OR $ ./cnvnator -root file.root -his 1000 -chrom 1 2 3 4 -fasta file_genome.fa.gz

Calculate statistics

$ ./cnvnator -root file.root -stat 1000

Partition

$ ./cnvnator -root file.root -partition 1000

Call CNVs

$ ./cnvnator -root file.root -call 1000

Import SNP data

$ ./cnvnator -root file.root -vcf file.vcf.gz OR $ ./cnvnator -root file.root -vcf file.vcf.gz -addchr # Options -addchr or -rmchr can be used to add or remove the "chr" prefix from # chromosome names in vcf file to match chromosom names from bam file.

Import mask data

$ ./cnvnator -root file.root -mask mask.fa.gz OR $ ./cnvnator -root file.root -mask mask.fa.gz -addchr

Generate SNP histograms

$ ./cnvnator -root file.root -baf 10000

Ploting

$ ./cnvnator -root file.root -view 10000

1:1M-50M 1:1M-50M baf

List root file content

$ ./cnvnator -root file.root -ls

Copy RD and SNP data to new root file

$ ./cnvnator -root file.root -cptrees new_file.root

Ploting RD and BAF whole genome circular plots using python tool:

$ ./plotcircular.py file.root

```

1. Compilation

Dependencies

You must install ROOT package and set up $ROOTSYS variable (see ROOT documentation here).

Also, a link to the samtools binary should be present in your CNVnator directory together with compiled libhts.a HTSlib library in a htslib* subdirectory.

If compilation is not completed but the file libbam.a has been created, you can continue.

Installation from release zip file (recommended)

See INSTALL for complete details.

Installation from github

``` git clone https://github.com/abyzovlab/CNVnator.git

cd CNVnator

ln -s /path/to/src/samtools samtools

make ```

If make doesn't work, try "make OMP=no" which will disable parallel support.

Installing with Yeppp support

Yeppp is a library which provides high-performance implementations of math functions.

To install with Yeppp support, download Yeppp from here and extract it to a location of your choice. Set YEPPPLIBDIR and YEPPPINCLUDEDIR directories appropriately.

Typically, for Linux-based systems on x86-64, YEPPPLIBDIR will be yeppp-1.0.0/binaries/linux/x86_64/ and YEPPPINCLUDEDIR will be yeppp-1.0.0/library/headers.

To build, type
make YEPPPLIBDIR=... YEPPPINCLUDEDIR=...

To disable OpenMP, add OMP=no to the make command.

2. Predicting CNV regions

Running CNVnator involves a few steps outlined below. Chromosome names and lengths are parsed from the input sam/bam file header.

2.1 EXTRACTING READ MAPPING FROM BAM/SAM FILES

$ ./cnvnator -root out.root [-chrom name1 ...] -tree [file1.bam ...] [-lite] where,

-root out.root -- specifies output ROOT file. See ROOT package documentation.
-chrom name1 ... -- specifies chromosome name(s).
-tree file1.bam ... -- specifies bam file(s) names. -lite -- use this option to produce a "lighter" (smaller) root file.

Chromosome names must be specified the same way as they are described in the sam/bam header, e.g., chrX or X. One can specify multiple chromosomes separated by space. If no chromosome is specified, read mapping is extracted for all chromosomes in the sam/bam file. Note that this would require machines with a large physical memory of at least 7Gb. Extracting read mapping for subsets of chromosomes is a way around this issue. Also note that the root file is not being overwritten.

Example:

./cnvnator -root NA12878.root -chrom 1 2 3 -tree NA12878_ali.bam

for bam files with a header like this:
@HD VN:1.4 GO:none SO:coordinate
@SQ SN:1 LN:249250621
@SQ SN:2 LN:243199373
@SQ SN:3 LN:198022430
...

or

./cnvnator -root NA12878.root -chrom chr1 chr2 chr3 -tree NA12878_ali.bam for bam files with a header like this:
@HD VN:1.4 GO:none SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
...

Example:

./cnvnator -root NA12878.root -chrom 4 5 6 -tree NA12878_ali.bam ./cnvnator -root NA12878.root -chrom 7 8 9 -tree NA12878_ali.bam

is equivalent to

./cnvnator -root NA12878.root -chrom 4 5 6 7 8 9 -tree NA12878_ali.bam

2.2 GENERATING A READ DEPTH HISTOGRAM

$ ./cnvnator -root file.root [-chrom name1 ...] -his bin_size [-d dir]

This step is not memory consuming and so can be done for all chromosomes at once. It can also be carried for a subset of chromosomes. Files with individual chromosome sequences (.fa) are required and should reside in the current directory or in the directory specified by the -d option. Files should be named as: chr1.fa, chr2.fa, etc.

2.3 CALCULATING STATISTICS

$ ./cnvnator -root file.root [-chrom name1 ...] -stat bin_size

This step must be completed before proceeding to partitioning and CNV calling.

2.4 RD SIGNAL PARTITIONING

$ ./cnvnator -root file.root [-chrom name1 ...] -partition bin_size [-ngc]

Option -ngc specifies not to use GC corrected RD signal. Partitioning is the most time consuming step.

2.5 CNV CALLING

$ ./cnvnator -root file.root [-chrom name1 ...] -call bin_size [-ngc]

Calls are printed to STDOUT by default. You may redirect them to a file using the redirect operator >

The output columns are as follows:

CNV_type coordinates CNV_size normalized_RD e-val1 e-val2 e-val3 e-val4 q0

where,

normalized_RD -- read depth normalized to 1.
e-val1 -- is calculated using t-test statistics.
e-val2 -- is from the probability of RD values within the region to be in the tails of a gaussian distribution describing frequencies of RD values in bins.
e-val3 -- same as e-val1 but for the middle of CNV
e-val4 -- same as e-val2 but for the middle of CNV
q0 -- fraction of reads mapped with q0 quality

2.6 REPORTING READ SUPPORT

To find and report read support for deletions and duplications by abnormal read pairs, use the -pe option as below:

./cnvnator -pe file1.bam ... -qual val(20) -over val(0.8) [-f file]

Once prompted, enter a genomic region and the CNV type, e.g.,

```

12:11396601-11436500 del or chr12:11396601-11436500 del ```

Please note that the bin size should be equal to a whole number of 100 bases (e.g., 2500, 3700,…)

2.7 MERGING ROOT FILES

./cnvnator -root out.root [-chrom name1 ...] -merge file1.root ... Merging can be used when combining read mappings extracted from multiple files.
Note: histogram generation, statistics calculation, signal partitioning, and CNV calling should be completed/redone after merging.

3. Importing VCF data

To import variant data from VCF file use following option:

./cnvnator -root file.root [-chrom name1 ...] [-rmchr | -addchr] -vcf file.vcf.gz

If chromosome names are not specified, data for all chromosomes from file.vcf.gz will be imported. If you would like to add or remove the "chr" prefix from your chromosome names, use options -addchr or -rmchr respectively. It is important that chromosome names in the vcf file and the SAM/BAM file match.

To mark known SNPs from the SNP database:

./cnvnator -root file.root [-chrom name1 ...] [-rmchr | -addchr] -idvar databasefile.vcf.gz

On running the above line, each SNP will be associated with a binary flag which equals 1 if it's in the database.

To mark variants based on genome accessibility using mask file from the 1000 Genomes Project:

./cnvnator -root file.root [-chrom name1 ...] [-rmchr | -addchr] -mask maskfile.fa.gz

On running the above line, each SNP will be associated with a binary flag which equals 1 if it's in the P-region

4. Genotyping genomic regions and visualization

For efficient genotype calculations, we recommend that you sort the list of regions by chromosomes.

./cnvnator -root file.root -genotype bin_size [-ngc]

Once prompted enter a genomic region, e.g.,

```

12:11396601-11436500 or chr12:11396601-11436500 or 12 11396601 11436500 or chr12 11396601 11436500 ```

One can also perform instant visualization by adding the word 'view', e.g.,

```

12:11396601-11436500 view or chr12:11396601-11436500 view or 12 11396601 11436500 view or chr12 11396601 11436500 view ```

Additional notes

For genotyping of multiple regions one can use input piping, e.g., ./cnvnator -root NA12878.root -genotype 100 << EOF 12:11396601-11436500 22:20999401-21300400 exit EOF

Another example: awk '{ print $2 } END { print "exit" }' calls.cnvnator | ./cnvnator -root NA12878.root -genotype 100

4.1 Visualizing specified regions

./cnvnator -root file.root [-chrom name1 ...] -view bin_size [-ngc]

Once prompted, enter a genomic region, e.g.,

```

12:11396601-11436500 or chr12:11396601-11436500 or 12 11396601 11436500 or chr12 11396601 11436500 ```

Additionally, one can specify the length of flanking regions (default is 10 kb) to be displayed as well, e.g.,

```

12:11396601-11436500 100000 or chr12:11396601-11436500 100000 or 12 11396601 11436500 100000 or chr12 11396601 11436500 100000 ```

One can also perform instant genotyping by adding the word 'genotype', e.g.,

```

12:11396601-11436500 genotype or chr12:11396601-11436500 genotype or 12 11396601 11436500 genotype or chr12 11396601 11436500 genotype ```

4.2 Plotting B-allele frequency (BAF)

To plot BAF data along RD use baf option in view mode:

``` ./cnvnator -root file.root -view bin_size

1:1-200000000 baf ```

The resulting output plot has two panels. On the uper panel, black line corresponds to binned RD signal, green to segmentation, and red to calls. On the bottom panel each dot corresponds to BAF value of the SNPs. Colors represent following:

black - homozygous (1/1 or 1|1) SNPs in P-region of the strict mask,
grey - homozygous (1/1 or 1|1) SNPs out of P-region of the strict mask,
blue - heterozygous (0/1 or 0|1) SNPs in P-region of the strict mask,
cyan - heterozygous (0/1 or 0|1) SNPs out of P-region of the strict mask,
red - heterozygous (1|0) SNPs in P-region of the strict mask,
orange - heterozygous (1|0) SNPs out of P-region of the strict mask.

plotbaf.py

Plot BAF data with python tool plotbaf.py (requires numpy, matplotlib installed):

./plotbaf.py [-h] [-bs BINSIZE] [-res RESOLUTION] [-o SAVE_FILE] [-t TITLE] [-nomask] [-useid] root_file region

Required arguments: * root_file: cnvnator root file name * region: chromosomal coordinates in the format chr:start-end

Optional arguments: * size of bins (default 100,000): -bs BINSIZE, --binsize BINSIZE * likelihood function resolution (default 100): -res RESOLUTION, --resolution RESOLUTION * save plot to file: -o SAVEFILE, --savefile SAVE_FILE * plot title: -t TITLE, --title TITLE * do calculations without mask: -nomask * do calculations using idvar filter: -useid

Output plot consists of four panels. Starting from the top one, they are:

BAF value for heterozygous SNPs.
Likelihood function. Light dots on the imagemap represent the most likely value of BAF at each bin.
Red line represents a distance between maxima positions in likelihood function that is equivalent to twice the absolute difference between most likely BAF value and 0.5. Blue dots represent the ratio between the value of the likelihood function at 0.5 and its maximum value.
Green dots and blue error-bars correspond to mean MAF and standard deviation per bin, respectively. Bin size is 100k base pairs.

plotrdbaf.py

Plot RD and BAF data with python tool plotrdbaf.py:

./plotrdbaf.py [-h] [-bs BINSIZE] [-rdbs RDBINSIZE] [-res RESOLUTION] [-o SAVE_FILE] [-t TITLE] [-nomask] [-useid] root_file region

Required arguments: * root_file: cnvnator root file name * region: chromosomal coordinates in the format chr:start-end

Optional arguments: * size of bins (default 100,000): -bs BINSIZE, --binsize BINSIZE * size of bins for RD signal (default 100,000): -rdbs RDBINSIZE, --rdbinsize RDBINSIZE * likelihood function resolution (default 100): -res RESOLUTION, --resolution RESOLUTION * save plot to file: -o SAVEFILE, --savefile SAVE_FILE * plot title: -t TITLE, --title TITLE * do calculations without mask: -nomask * do calculations using idvar filter: -useid

Output plot consists of four panels. Starting from the top one, they are:

Read depth (RD) signal.
BAF value for heterozygous SNPs.
Likelihood function. Light dots on the imagemap represent the most likely value of BAF at each bin.
Red line represents a distance between maxima positions in likelihood function that is equivalent to twice the absolute difference between most likely BAF value and 0.5. Blue dots represent the ratio between the value of the likelihood function at 0.5 and its maximum value.
Green dots and blue error-bars correspond to mean MAF and standard deviation per bin, respectively. Bin size is 100k base pairs.

plotcircular.py

Plot RD and BAF data with python tool plotcircular.py:

./plotcircular.py [-h] [-chrom CHROMOSOMES] [-bs BINSIZE] [-o SAVE_FILE] [-t TITLE] [-rdbs RDBINSIZE] [-pbs PLOTBINSIZE] [-nomask] [-useid] root_file

Required arguments: * root_file: cnvnator root file name

Optional arguments: * comma separated chromosom list: -chrom CHROMOSOMES, --chromosomes CHROMOSOMES * plot bin size (default 1,000,000): -pbs PLOTBINSIZE, --plotbinsize PLOTBINSIZE * size of bins(default 100,000): -bs BINSIZE, --binsize BINSIZE * size of bins for RD signal (default 100,000): -rdbs RDBINSIZE, --rdbinsize RDBINSIZE * save plot to file: -o SAVEFILE, --savefile SAVE_FILE * plot title: -t TITLE, --title TITLE * do calculations without mask: -nomask * do calculations using idvar filter: -useid

Output plot is circular. Inner plot represents RD signal, while outher represents MAF (Minor allele frequency) signal.

5. Exporting CNV calls as VCFs

In order to export your CNV calls as a VCF file, use the script cnvnator2VCF.pl as

cnvnator2VCF.pl -prefix study1 -reference GRCh37 sample1.cnvnator.out /path/to/individual/fasta_files

where,

-prefix specifies a prefix string you want to append to the ID field in your output VCF. For e.g., if you set your -prefix as "study1", then your resulting ID column will be study1CNVnatordel1, study1CNVnatordel2 etc.

-reference stands for the name of reference genome you used, for e.g., GRCh37, hg19 etc.

file.calls is your CNVnator output file with the CNV calls

genome_dir is the directory containing your individual reference fasta files such as 1.fa, 2.fa etc. (or chr1.fa, chr2.fa etc.)

6. Python module: Read CNVnator data from root file

Use python module pytools.io to extract CNVnator data from root file.

import pytools.io io=pytools.io.IO("file.root") positions,rd=x.get_signal("1",100000,"RD") positions2,phased_baf=x.get_signal("1",100000,"SNP baf",flag=pytools.io.FLAG_USEHAP|pytools.io.FLAG_USEMASK) positions,ybins,likelihood=x.get_signal_2d("1",100000,"SNP likelihood")

List of available signals: * "RD" * "RD unique" * "RD raw" * "RD partition" * "RD call" * "GC" * "SNP count" * "SNP baf" * "SNP maf" * "SNP likelihood"

List of available flags: * RD signal: FLAGGCCORR * RD signal: FLAGATCORR * SNP signal: FLAGUSEMASK * SNP signal: FLAGUSEID * SNP signal: FLAG_USEHAP

Contact Us

Please send your comments and suggestions to abyzov.alexej@mayo.edu

Owner

Name: Abyzov lab
Login: abyzovlab
Kind: organization
Location: Rochester, MN

Website: http://abyzovlab.org
Repositories: 11
Profile: https://github.com/abyzovlab

Software packages developed in Abyzov's lab

Citation (CITATION)

1   ***************************************************************************
Abyzov A, Urban AE, Snyder M, Gerstein M.

CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing.

Genome Res. 2011 Jun;21(6):974-84. doi: 10.1101/gr.114876.110.


2   ***************************************************************************
 Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, Yoon SC, Ye K, et al.

Mapping copy number variation by population-scale genome sequencing.

Nature. 2011 Feb 3;470(7332):59-65. doi: 10.1038/nature09708.


3   ***************************************************************************
Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K, et al.

An integrated map of structural variation in 2,504 human genomes.

Nature. 2015 Oct 1;526(7571):75-81. doi: 10.1038/nature15394.

GitHub Events

Total

Issues event: 2
Watch event: 16
Issue comment event: 12
Pull request event: 2
Fork event: 3

Last Year

Issues event: 2
Watch event: 16
Issue comment event: 12
Pull request event: 2
Fork event: 3

Committers

Last synced: over 2 years ago

All Time

Total Commits: 134
Total Committers: 21
Avg Commits per committer: 6.381
Development Distribution Score (DDS): 0.56

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Milovan Suvakov	s**v@g**m	59
abyzovlab	a**j@m**u	23
Marghoob Mohiyuddin	m**m@b**m	10
Abyzov	m**3@r**u	10
Abyzov	m**3@r**g	4
Daniel Nilsson	d**n@s**e	4
ShobanaSekar	s**2@a**u	3
Taejeong Bae	b**z@g**m	3
Indraniel Das	i**s@g**u	2
Alexej Abyzov of group gerstein	a**5@l**l	2
Abyzov	m**3@r**u	2
Jason Stajich	j**n@b**g	2
Abyzov	m**3@r**u	2
Damien Zammit	d**n@z**m	1
Suvakov.Milovan@mayo.edu	m**4@m**u	1
Mike Dacre	m**e@g**m	1
Joel Martin	J**n@l**v	1
Jason Stajich	j**h@u**u	1
arpanda	a**a@g**m	1
Mariusz Karpiarz	m**z@v**m	1
Nathan Weeks	1****s	1

Committer Domains (Top 20 + Academic)

vscaler.com: 1 ucr.edu: 1 lbl.gov: 1 mforgehn1.mayo.edu: 1 zamaudio.com: 1 r5021059.mayo.edu: 1 bioperl.org: 1 r5043868.mayo.edu: 1 genome.wustl.edu: 1 asu.edu: 1 scilifelab.se: 1 r5144108.mfad.mfroot.org: 1 r5144108.mayo.edu: 1 binatechnologies.com: 1 mayo.edu: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 103
Total pull requests: 2
Average time to close issues: 10 months
Average time to close pull requests: 4 days
Total issue authors: 88
Total pull request authors: 2
Average comments per issue: 3.55
Average comments per pull request: 0.5
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 15
Pull requests: 0
Average time to close issues: about 1 month
Average time to close pull requests: N/A
Issue authors: 13
Pull request authors: 0
Average comments per issue: 2.8
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

chirrie (4)
MaoSihong (3)
jingydz (3)
omG0-hub (2)
kghimire09 (2)
lgmgeo (2)
neharajkumar (2)
yasin-uzun (2)
FerFrancis (2)
gevro (2)
assane-mbodj (2)
mwaldron104 (1)
QuanLG (1)
BIjoy92 (1)
AJ211 (1)

Pull Request Authors

Lightoscope (1)
joelmartin (1)
arpanda (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads: unknown

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 1

spack.io: cnvnator

A tool for CNV discovery and genotyping from depth-of-coverage by mapped reads.

Homepage: https://github.com/abyzovlab/CNVnator
License: []
Latest release: 0.3.3
published about 4 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent repos count: 0.0%

Forks count: 12.6%

Stargazers count: 14.8%

Average: 21.2%

Dependent packages count: 57.3%

Last synced: 10 months ago

cnvnator

Science Score: 28.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

README

Quick start guide

Extract read mapping

Generate histogram

Calculate statistics

Partition

Call CNVs

Import SNP data

Import mask data

Generate SNP histograms

Ploting

List root file content

Copy RD and SNP data to new root file

Ploting RD and BAF whole genome circular plots using python tool:

1. Compilation

Dependencies

Installation from release zip file (recommended)

Installation from github

Installing with Yeppp support

2. Predicting CNV regions

2.1 EXTRACTING READ MAPPING FROM BAM/SAM FILES

2.2 GENERATING A READ DEPTH HISTOGRAM

2.3 CALCULATING STATISTICS

2.4 RD SIGNAL PARTITIONING

2.5 CNV CALLING

2.6 REPORTING READ SUPPORT

2.7 MERGING ROOT FILES

3. Importing VCF data

4. Genotyping genomic regions and visualization

Additional notes

4.1 Visualizing specified regions

4.2 Plotting B-allele frequency (BAF)

plotbaf.py

plotrdbaf.py

plotcircular.py

5. Exporting CNV calls as VCFs

6. Python module: Read CNVnator data from root file

Contact Us

Owner

Citation (CITATION)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

spack.io: cnvnator

Rankings