Recent Releases of py-cnvkit

py-cnvkit - Version 0.9.12

This is a bugfix release that addresses installation problems, an occasional issue with segmentation, and a command option that had been non-functional.

Fixes

Re-enable the coverage -q/--min-mapq option. It had stopped working at some point due to a type coercion issue. (#912; thanks @rach-kennedy)
Prevent CBS segmentation failures due to nulls in input .cnr. It's not clear what causes nulls to appear in .cnr files, but when they do, segmentation failed; this is happened silently in batch mode and could be difficult for user to triage when it happened. (#914, #436, #582, maybe #760, #896, #901 and nf-core/sarek#1625).
Raise max pomegranate dependency version from <=0.14.9 to <1.0.0 to avoid conflicts during installation. (#911, #890)

New Contributors

@rach-kennedy made their first contribution in https://github.com/etal/cnvkit/pull/912

Full Changelog: https://github.com/etal/cnvkit/compare/v0.9.11...v0.9.12

- Python
Published by etal over 1 year ago

Version 0.9.11

New features

Most commands include a new option, --diploid-parx-genome, to treat the pseudoautosomal regions (PAR1/2) of human chromosome X as autosomal, i.e. diploid regardless of sample sex. The value it takes is a human reference genome ID such as "grch38". This feature should help reduce false calls on sex chromosomes in human samples. (Thanks @rollf; #789)
The fix command takes a new option --smoothing-window-fraction to allow manual tuning of the smoothing window used in GC and other automatic bias corrections. (Thanks @kkchau; #859)
hg38 refFlat and genome accessibility data files are now included in the source tree. (Thanks @berguner; #822, #837)

Bug fixes

The Docker image once again includes the additional scripts beyond cnvkit.py.
User-specified sample sex with -x now works properly. (Thanks @28rietd and @ccoo22; #843, #851)
User-specified smoothing window size now applies in HMM segmentation. (Thanks @zhuying412; #833, #835)
An error in export vcf has been fixed. (Thanks @pwwang; #818)

Other updates

Dependency versions are updated to match Ubuntu 23.04 Lunar, more or less.
Automated testing is done on Python version 3.8 through 3.12 -- these are the "supported" versions.
Small documentation fixes.

New Contributors

@dependabot made their first contribution in https://github.com/etal/cnvkit/pull/791
@pwwang made their first contribution in https://github.com/etal/cnvkit/pull/818
@berguner made their first contribution in https://github.com/etal/cnvkit/pull/837
@zhuying412 made their first contribution in https://github.com/etal/cnvkit/pull/835
@kkchau made their first contribution in https://github.com/etal/cnvkit/pull/844
@28rietd made their first contribution in https://github.com/etal/cnvkit/pull/851

Full Changelog: https://github.com/etal/cnvkit/compare/v0.9.10...v0.9.11

- Python
Published by etal about 2 years ago

This long-awaited release includes major plotting enhancements in the heatmap, scatter, and diagram commands, as well as a new export gistic command, thanks to joint work by @tetedange13 and @tskir (see below).

There are also significant infrastructure improvements including bug fixes, modernized packaging, and build/test automation.

New features

diagram:

New options --no-gene-labels to not display gene labels on the plot, and -c / --chromosome to plot a single chromosome (#628, #629, #634; thanks @tetedange13)

heatmap:

New CLI options (#35, #625, #632, #652; thanks @tetedange13 and @tskir):

--vertical: Transpose the plot, displaying the genome axis vertically instead of horizontally
--delimit-samples: Add an delimitation line between each sample row (or column, with --vertical)
--title: Set the plot title

scatter:

New option --fig-size: Set the output image dimensions (#600, #641; thanks @tetedange13 and @tskir)
Show triangles at the bottom of the plot to indicate where segments are hidden below the plotted region by automatic pruning at 'ymin=-5'. Also log a warning when this happens. (#385, #643, #645; thanks @tetedange13, @tskir, and @micknudsen)

export gistic:

New export command to generate an unsegmented "markers" file for use with GISTIC. GISTIC also takes a second input file with corresponding segments in SEG format, which CNVkit can generate with export seg. (#622, #623, #776; thanks @tetedange13, @tskir, @BioComSoftware)

API and CLI changes

Running cnvkit.py without any arguments will now display the full help text instead of an error message.
Supporting scripts (aside from cnvkit.py) are no longer installed automatically. They are still available in the source tree.

Documentation

Clarified bintest usage, provided an example, and explained outputs. (#646; thanks @tetedange13 and @tskir)

Bugfixes

Fixed several errors and warnings due to outdated usage of dependencies, e.g. pandas, pysam.
Fixed the Dockerfile and Docker image to install R packages properly for CNVkit to use internally. (#765; thanks @28rietd)
Made the Makefile example/test workflow more portable across environments. (#661, #666, #695, #699; thanks @tetedange13)
batch: Apply --drop-low-coverage option in the segmetrics step. (#694)
bintest: Include 'probes' column in .cns output so that it is valid .cns (closes #693)
fix: Condense the error message when coordinate set contains duplicate values. (#637, #638; thanks @tskir)
fix: Choose a smoothing window fraction based on the data size to help correct biases better at the extremes of the GC range, where previously some residual GC bias could still be present after correction. (#379)
BED inputs: Handle UCSC BED 'browser' header line, as used in Agilent BED files with a 2-line header. (closes #696, #618)

Internal

Modernized the packaging configuration with pyproject.toml, leaving a stub setup.py for legacy setuptools compatibility. (#790)
Set up automated testing through GitHub Actions (GHA) to verify Python versions 3.7 through 3.10 using pytest and tox. The latter make local testing with multiple Python versions more reliable, too. (#792, #793, #794)
Updated minimum dependency versions to roughly match Ubuntu 22.04 LTS packages; these are used in CI, too.
Applied black and pylint to reformat the codebase consistently and replace deprecated calls to libraries. (#795)
Remove joblib pinning (#589, #770; thanks @DavidCain and @risicle)
Remove networkx pinning (#606, #771; thanks @DavidCain)
Make the extreme-GC filters more easily configurable via params.py (#738, #752, #753, #764; thanks @tetedange13 and @tsivaarumugam)

- Python
Published by etal over 3 years ago

py-cnvkit - Version 0.9.9

This release contains a new script and, more importantly, a volley of bug fixes by @tskir, a new CNVkit collaborator.

New script

genome_instability_index.py

For each given sample (.cnr or .cns, ideally .call.cns), this script reports two values, the number of non-neutral segments and the fraction of the total sequencing-accessible genome that they cover. Together, these values have been described as the Genome Instability Index (G2I) by Bonnet et al. (2012). These numbers are not difficult to calculate directly from .cns files, but they are frequently requested, so here you go.

Bug fixes by @tskir

Installation: - Set NetworkX minimum version to work with pomegranate on Python 3.9. (#614, #606; thanks @auberginekenobi)

genemetrics, diagram, scatter:

Fix an error in iterating over chromosomes during gene-wise operations or gene selection. (#580, #573, #576, #579; thanks @diushiguzhi @eriktoo @hrkemp @drmrgd @HYan-lei)

access:

Fix an error when all chromosomes listed in the exclusion BED file appear only once. (#581, #574; thanks @dajana17)

autobin:

Allow specifying explicit output filenames via -o/--output. If this option is not used, the behavior is the same as before. Some pipeline frameworks such as Snakemake require output filenames to be explicit in wrapped commands. (#608, #607; thanks @enes-ak)
Fix median-size file selection. (#613, #611; thanks @michaelsykes)

coverage:

Fix a potential crash with the -c option; generally make the -c option's results more stable. This changes the results you'd get with coverage -c compared to previous CNVkit versions, but in any case -c isn't recommended for production use, only for algorithm exploration. (#598, #593; thanks @joys8998)

genemetrics:

Rename column n_bins to probes in output, for compatibility with 'call' and 'export' commands. (#586, #585; thanks @eriktoo)

scatter:

Avoid losing short segments in rasterized PNG output, depending on DPI settings. (#615, #604; thanks @jimmy200340)
Allow NCBI-style chromosome names that contain a ".", e.g. "NC_039902.1". (#603, #602; thanks @amora197)

segment:

Fix an IndexError during smoothing when the signal is shorter than a window, e.g. on chrY where the chromosome contains few bins. (#590, #587; thanks @tetedange13)

Improvements from other contributors

scripts/guess_baits.py: Fix a copy-paste error on script launch. (#588; thanks @sssimonyang)
Documentation: Link to the Debian package alongside other packages. (#562; thanks @mr-c)

- Python
Published by etal about 5 years ago

py-cnvkit - Version 0.9.8

Continuing a focus on stability and compatibility with other software:

Support for reading CRAM files with an optional user-provided local FASTA file for the reference genome sequence. (#555; thanks @johnegarza)
Call Rscript subprocess with safer flags for the R environment. Previously, --vanilla ignored R environments with the library path in a non-default location specified in the user's .Rprofile. Now, --no-restore and --no-environ ensure a clean environment but still respect the user's .Rprofile settings beyond that. (#491; thanks @pablo-gar)
Compatibility with the latest release of pandas. (#502, #523)

This release also fixes some regressions reported since the release of CNVkit 0.9.7 (which introduced a number of new performance optimizations).

scatter: A bug when plotting a region of a chromosome. (#536, #457; thanks tskir)
scatter: An IndexError when plotting entire chromosomes, e.g. chr7. (#541, #461, #535; thanks @tskir)
fix: A bug that occurred after automatic bias corrections, introducing NaN-valued rows in placed of rejected bins, leading to a downstream crash in CBS segmentation. (#551, #436, #547; thanks @johnegarza)

- Python
Published by etal about 5 years ago

py-cnvkit - Version 0.9.7

Stable release with only minor changes from the previous beta release 0.9.7.b1.

New contributions:

Cram support: Look for and use .cram + .crai alignment and index file pairs, in addition to .bam + .bai. (#495, #434; thanks @sridhar0605)
Update Docker file to use Python 3 apt packages and pip3 (#493; thanks @keiranmraine)
Documentation fix (#496; thanks @rollf)

- Python
Published by etal about 6 years ago

py-cnvkit - Version v0.9.7 beta

This release contains several major enhancements particularly relevant to germline analysis. If used in production pipelines, further evaluation and benchmarking would be wise. Highlights:

Control sample clustering: To make better use of larger reference sample pools, reference --cluster will correlate the given normal samples' bin-wise coverage depths to extract clusters to be used as reference profiles. The reference .cnn file produced this way will then contain the log2 and spread summary statistics for each cluster, in addition to the global summary stats. Given this "clustered reference" profile, fix --cluster will then correlate each test sample to each clustered log2 profile in the reference to choose the most relevant control pool for normalization. The batch option --cluster will perform both these steps. Nod to Gambin lab and the authors of ExomeDepth, CoNVaDING, CLAMMS, and others for inspiration. (#308)

Calculation of bin weights has changed. This will change your segmentation results, hopefully for the better. Details below. (#429)

The batch pipeline now performs some segmentation post-processing automatically: calculating and filtering segmentation calls by 50% confidence intervals of the segment mean log2 ratios, in order to reduce false positives, followed by separate bin-level testing to detect small (e.g. exon-size) CNVs that were not caught by segmentation. The bin- and segment-level results are returned as separate .cns files; deciding whether and how to combine or use these results together is left as an exercise for the user.

We've dropped Python 2.7 support. Python version 3.5 or later is now required.

This is a beta release. Please let me know how it works for you via the Issues page. If this release contains any issues that are blocking your work, try installing one of the previous stable versions 0.9.6 or 0.9.5::

conda install cnvkit=0.9.6

Dependencies

Remove all Python 2.7 compatibility shims.
Raise minimum pandas version from 0.20.1 to 0.23.3.
Add scikit-learn (dependency of pomegranate, for HMM segmentation). Remove the older hmmlearn implementation.

Commands

batch:

Post-process segments with segmetrics (50% CI), call (filter by CI, but don't call integer copy number), and bintest.
Return bintest result as a separate, independent .cns output.
Add option '--segment-method', equivalent to segment -m.
Rename option '--method' to '--seq-method' (but '--method' still accepted for now).
Add option --cluster, passed to reference and fix if given. (#308)

bintest:

New command superseding cnv_ztest.py script.
Report p-value as a column p_bintest (previously ztest) in the .cns output.
Fix probabilities for positive log2 values, i.e. gains, which previously always had p-value = 1.0. (#429)

fix:

Change calculation of bin weights to be more consistent with 1-var meaning, with more emphasis on reference spread. It is now simpler, more consistent with import-rna, and particularly improves the accuracy of bintest. (#429)
Squeeze the range of reference-free weights
Drop bins with gc outside [.3, .7]. CLAMMS paper shows these bins carry no useful signal.
With --cluster and a clustered reference input, calculate the test sample's Pearson correlation versus each cluster's log2, and take the best one for normalization.

reference:

With --cluster, do k-means clustering of the sample bin-level read depth correlation matrix, per Kusmirek et al. 2018. Parameter k defaults to the cube root of number of samples. Only clusters of at least 4 samples are kept for emitting summary statistics in the reference profile.

segment:

hmm: Fix pomegranate-based implementation. Use iterative Savitzky-Golay smoothing with a narrow bandwidth.
Use HMM for post-TCN segmentation on VCF allele freqs
Add parameter for smoothing before CBS (thanks @EwaMarek)

segmetrics:

Add 'ttest' option for 1-sample t-test p-value.
Implement & expose --smooth-bootstrap option. For smoothing, KDE bandwidth is based on each bin's weight as a proxy for the SD of its log2 ratio values. To reduce the risk of over-smoothing on larger sample sizes, we use a loose interpretation of Silverman's Rule to reduce the bandwidth as the number of bins in a segment increases (k^-1/4).

API

do_heatmap: Add 'ax' parameter (thanks @fbrundu)
CNA.residuals(): speed; keep index intact in returned pd.Series
smoothing: Linearly roll-off weights in mirrored wings. Affects CNA.smoothed() / savgol, but not rolling median bias correction.
Rename CNA.smoothed() to CNA.smooth_log2(), since it returns the smoothed log2 values, not a new/altered CNA.

Bug fixes

batch: Fix argparse formatting issue (#466)
import-rna: Fix a regression in reading 2-column per-gene counts (-f counts).
reference: Fix sex inference/usage when creating haploid-x reference (#459; thanks @duartemolha)
scatter: Use a safe matplotlib backend on OS X to avoid crash
VariantArray: Fix/streamline indexing of variants by bin/segment

- Python
Published by etal over 6 years ago

py-cnvkit - Version 0.9.6

Essential maintenance and bug fixes, for the most part. Some key dependencies have changed, though this should be generally painless for you, and one or two regressions introduced by recent optimizations have been fixed.

This will be the last CNVkit version to run on Python 2.7. The next major release of pandas (0.25.0) will remove support for Python 2.7, and once that happens it will become increasingly difficult to install future versions of CNVkit on Python 2.7 -- so we're not going to try.

The segmentation method flasso depends on the R package cghFLasso, which is unmaintained and has been removed from CRAN. For now, segment -m flasso is still supported if you already have cghFLasso installed. But given the above, flasso will be removed from the next CNVkit version in favor of the HMM-based methods.

Dependencies

Raised minimum pandas version from 0.18.1 to 0.20.1, and support up to 0.24.2, resolving some warnings and an error in pandas 0.22+. (#413; thanks @chapmanb)
The soft dependency on hmmlearn is replaced with an explicit dependency on pomegranate for the HMM-based segmentation methods. This dependency will now be pulled in automatically when installing via pip or conda.
The R package cghFLasso has been removed from CRAN, and therefore is no longer a dependency of CNVkit and will not be installed automatically through the standard conda installation method. (#419)

Commands

antitarget:

Be more specific in removing noncanonical chromosomes (e.g. alternate contigs, mitochondria) from the binned regions. This avoids skipping chromosomes of interest in some non-human genomes with non-numeric contig names, like yeast. (#388; credit for regexes to @brentp)

coverage:

With --count-reads, use query aligned length to handle soft-clipped reads properly. Now the results with and without this option should be similar. (#411; thanks @desnar)

segment:

For -m flasso, partition array by chromosome to avoid edge effects. (#409, #412; thanks @giladmishne)
Removed the deprecated option --rlibpath; use --rscript-path instead.
HMM implementations have changed, and results may be different now. Note that the HMM methods are still provisional. A stable, supported version of these methods will be provided in the next CNVkit release.

Python API

do_scatter now returns a figure (#408; thanks @jeremy9959)

Bug fixes

scatter: Whole chromosomes can once again be specified with -c. (In the previous release, a chromosome without coordinates would cause an IndexError.) (#393)
import-rna: Option --max-log2 can now be specified by users. (Previously, only the default value of +3.0 worked.)
VCF I/O (skgenome.tabio): Support GATK 4's VCF files that contain records with empty ALT alleles, substituting zero if ALT AD is missing. (#391; thanks @chapmanb)
Due to a certain versioning-dependent interaction between numpy, pandas, cython, and conda (details here), CNVkit may have printed spurious RuntimeWarning messages which could be safely ignored. The current release attempts to silence these messages if they occur. (#390).

- Python
Published by etal over 7 years ago

py-cnvkit - Version 0.9.5

Minor bugfix and usability improvement.

autobin: - Ensure targets are non-empty and match BAM chrom names (#371)

segment: - Suppress help text for deprecated --rlibpath (#317) - Fix help text display (#380)

- Python
Published by etal almost 8 years ago

py-cnvkit - Version 0.9.4

Performance improvements and bug fixes. Improved automated testing (#254) and documentation (#334).

Optimized performance of selecting genomic intervals, in particular speeding up call, segment, and segmetrics for whole genome and exome datasets. (#340, #346)

Added script snpfilter.sh to help create T/N VCFs suitable for use with CNVkit. (#364)

Commands

batch, segment:

Add option --rscript-path to specify the preferred Rscript installation to use in case it is not in the default path. Deprecate the similar option --rscriptpath. (#317, #321, #322; thanks @MajoroMask and @chapmanb)

reference:

Only print the rejected targets if there are fewer than 500 of them; otherwise, just print the number that were rejected. (#354)

segment:

Tighten 'flasso' p-value threshold from .005 to .0001. The more lenient threshold had led to over-segmentation.

segmetrics:

Optimize bootstrapping procedure for ~10x speedup and lower memory usage. (#346)

call:

Add option --drop-low-coverage, matching the other commands.

import-rna:

Implement -n/--normal option. (#362)
Add --max-log2 option, default +3.0.
Add options --no-gc, --no-txlen to disable bias corrections.

export bed:

Add option --label-genes. By default, the 4th column is filled with the sample ID, which is undesirable if only sample (.cns file) is being exported to BED. This option keeps the gene labels.

Python API

Changed default intersection mode from 'inner' to 'outer'. For the CNVkit command line operations this shouldn't have a visible effect.
BED file parser handles (i.e. skips) initial "browser position" line.
Add method GenomicArray.iter_ranges_of() to iterate over intervals retrieving values of a specified column, without copying chunks of the entire GenomicArray table.
Add method GenomicArray.intersection() (#340)
tabio: Add 'vcf-simple' and 'vcf-sites' reader formats (WIP; #231)

Bug fixes

scatter: Avoid an error in smoothing (#369; thanks @mpschr)
sex: Don't crash if chrX or chrY is missing; just print "NA"
import-rna: Avoid a crash if -n is not used.
Script cnv_expression_correlate.py: Avoid a crash on Py3
Script cnv_annotate.py: Fix command-line option parsing (#367)

- Python
Published by etal almost 8 years ago

py-cnvkit - Version 0.9.3

A quick bugfix release to fix a potential crash in the segmetrics command (#325).

- Python
Published by etal about 8 years ago

py-cnvkit - Version 0.9.2

This release contains a new command import-rna to infer coarse-grained copy number from RNA expression data. (#151)

Three new HMM-based segmentation methods are offered: 'hmm', 'hmm-germline', and 'hmm-tumor'. These should be considered experimental and used with caution; the implementations are likely change in the next release.

The option --male-reference in the commands batch, reference, fix, call, and export (at least) has been renamed to --haploid-x-reference everywhere to reduce user confusion. A shim is in place so --male-reference will continue to work.

Documentation, logging, and some error messages are improved.

Thanks to @chapmanb, @MajoroMask, and others for contributing to this release.

Dependencies

'pandas' version 0.22 is supported.
'pysam' version 0.13.0 is supported.
'hmmlearn' version 0.2 is a run-time requirement to use the new HMM-based segmentation methods. The rest of CNVkit can be run without it. To ensure the right version is installed, install CNVkit with conda as usual, then install hmmlearn with pip within the CNVkit conda environment.
Assume and require pip/setuptools for installation. (This is included with stock Python 2.7 and later.)

Scripts

New script "skg_convert.py" to convert between BED, GATK interval list, GFF, VCF, and tabular formats using the 'skgenome.tabio' sub-package, with options for simple post-processing.
Removed the deprecated script refFlat2bed.py. (Use skg_convert.py instead.)

Commands

access:

Drop noncanonical, untargeted contigs/chromsomes by default. This affects analyses run from scratch with batch, too. (#169, #299)

segment:

Three new methods can be specified with -m: hmm, hmm-germline, and hmm-tumor.
With -m flasso, force a breakpoint at centromeres, as was already done for the default 'cbs' method.

reference:

The option --antitargets is no longer required to build a flat reference. Previously, building a flat reference for WGS or TAS required creating an empty file to use as antitargets alongside the target BED.
Print a warning if the sample sex inferred from targets does not match that of antitargets. (#281)

scatter:

Removed the deprecated, invisible option --background-marker. (Use --antitarget-marker instead.)
Trendlines should reflect small CNVs better, while preserving overall smoothing. The implementation now uses the Savitzky-Golay method instead of a Kaiser window, and the smoothing bandwidth is better-tuned. (This can also slightly improve outlier filtering in segment.)

export seg:

Add option --enumerate-chroms to replace chromosome or contig names with sequential integers. Previously, this renumbering was always done, following some version of the SEG format. But since most tools don't require the contigs to be sequential integers, and this behavior causes trouble for users, it's now disabled by default. (#282)

gainloss/genemetrics:

Rename gainloss command to genemetrics. A shim is in place so cnvkit.py gainloss will continue to work. (#278)
Report segment- and bin-level weight and probes separately. (#107, #278)

Bug fixes

autobin: Require -g/--access for WGS (#289)
batch: Use the "access" regions for the WGS workflow to choose bin size; these were previously being ignored, so bin sizes were too large, being based on the size of the whole genome, not just sequencing-accessible regions.
call: Safely handle bins with zero weight when running call --filter cn. (chapmanb/bcbio-nextgen#2112; thanks @chapmanb)
coverage, guess_baits.py: Handle input BED files containing >4 columns. (#301)
gainloss: Without -s, make 'depth' the weighted mean of bins, not just the first bin's value.
segment: Ensure the .cns output file's columns are sorted properly (#291)
vcfio: Don't crash if a record has no ALT values (#279)
tabio:
- Recognize BED format with decimal in chromosome name (#293)
- Improvements to GFF/GTF/GFF3 parsing. The new options are mostly accessible through the Python API and the script 'skg_convert.py'. (#311)
- In 'read_auto' (and all CNVkit commands that take regions as input), determine the file format first by checking the file extension and verifying the format of the first(-ish) line. Only if that doesn't work, fallback to the original method of testing the first(-ish) line against a brittle series of regular expressions. (#315)

Python API

cnvlib.write: Newly available at the top level to write tabular files (like .cnr and .cns), symmetric with 'cnvlib.read()'. The 'cnvlib.tabio' alias to 'skgenome.tabio' has been removed; to read and write formats other than TSV-with-header ('tab'), import and use 'skgenome.tabio' directly.
CopyNumArray.squashgenes: remove deprecated keyword argument 'squashbackground'. Use 'squash_antitarget' instead.
segmetrics: Move the functions supporting this command from 'cnvlib.command' to a new module 'cnvlib.segmetrics'.

- Python
Published by etal over 8 years ago

py-cnvkit - Version 0.9.1

Highlights: Useful enhancements and changes to plotting and segmentation, and a new script for single-exon CNV testing.

Plus, bug fixes and usability improvements to avoid unexpected errors. (#250, #255, #262, etc.)

Dependencies

Compatible with the most recent pandas version 0.21.0 (#273, #274; thanks @chapmanb)
R dependencies were reduced to simplify installation.

Scripts

Renamed "cnn*.py" to "cnv*.py".
New script "cnv_ztest.py" to detect single-bin (e.g. single exon) deep deletions and high-level amplifications.
In "cnv_updater.py", rename "Background" (i.e. off-target) bins to "Antitarget", addition to adding a "depth" column if it's missing.

Commands

autobin:

Raise the maximum target/antitarget bin sizes to 50kb/1Mb.

fix:

Allow specifying sampleid via --sample-id/-id, in case the input coverage filenames do not have the expected form "sampleid.targetcoverage.cnn" and "sample_id.antitargetcoverage.cnn". (#269; thanks @chapmanb)

segment:

Process each chromosome arm separately (with 'cbs' and 'haar', but not 'flasso'). Centromere locations are guessed from the largest gap between sequencing-accessible regions, and are not necessarily the true locations, although they do match fairly well on the human genome.
Logging of dropped bins is streamlined somewhat.
New method -m none to only calculate arm-level segment means (for testing and experimentation).

scatter:

Highlight non-neutral segments from .call.cns. If segments have the columns 'cn' and potentially also 'cn1' and 'cn2' (as added by the call command), use those fields to display copy number alterations, LOH and allelic imbalance with colorized segments (orange by default), and use gray for neutral segments. If a VCF is also given, the same is done for SNVs in the lower panel. Otherwise, all segments are colorized as before. (#18, #157)
New option --by-bins to display x-axis positions by sequential bin number on each chromosome, rather than genomic coordinates. This makes the plots much more useful with targeted amplicon sequencing data, or very small gene panels. (#63)
Trend line (--trend) now accounts for bin weights, which generally results in a better fit.
Improved interaction of -c and -g options:
- Only apply the window margin (-w) if -g is used alone, or -c specifies a small chromosomal region with no genes.
- Allow an empty gene list (-g '' or -g ',') to prevent highlighting and labeling of any genes / small non-genic "Selection" in the -c region.
- If any gene in -g is not fully within the region specified by -c, name that gene and its coordinates in the error message.
- If the -c region has size <=0, show a specific error message.
- Handle NaN log2 values when calculating y-axis limits.

heatmap:

Incorporate the --by-bins argument to match scatter. (#63)
Warn if selected region contains no data for a sample. This helps troubleshoot if a chromosome name was mis-specified on the command line. (#268)

export seg:

Change column headers to match DNAcopy output. The column headers generally don't matter in the SEG format, but the DNAcopy dataframe is considered the canonical form.

Python API

cnvlib.dosegment -- new keyword argument minweight to drop bins with 'weight' below the specified value. If not used, then only bins with weight 0 will be dropped. This feature is not recommended for normal usage and is not available on the command line.
cnvlib.doscatter -- Remove deprecated keyword argument 'backgroundmarker' in favor of 'antitarget_marker', corresponding to scatter options deprecated in v0.9.0.
cnvlib.cnary.CopyNumArray: Add method 'smoothed', which calculates the trendline displayed by the scatter command.
skgenome.tabio: Add read support for samtools 'dict' format, which resembles the plain-text SAM header and can contain chromosome names and sizes.
skgenome.gary.GenomicArray: Add magic methods __bool__ (Py3) and __nonzero__ (Py2) to ensure an empty GenomicArray, i.e. 0 rows, is treated as false-ish on both Python 2.7 and 3.x.

- Python
Published by etal over 8 years ago

py-cnvkit - Version 0.9.0

In addition to bug fixes, documentation updates, and usability improvements, this release includes some larger changes:

The off-target bins in .cnn and .cnr files are now assigned the label "Antitarget" instead of "Background" in the "gene" column.

The label "Background" in existing files will still be handled the same way, but new output files generated with CNVkit 0.9.0 and later will use the "Antitarget" label -- so, earlier versions of CNVkit may have problems with files produced by CNVkit 0.9.0. Some command line options and API keyword arguments similarly replace "background" with "antitarget", with shims in place for compatibility with existing scripts. (#171)

The sub-packages 'genome' and 'tabio' are now in a separate top-level package 'skgenome', still included in the CNVkit distribution. (See "Python API" below.)

This does not affect the command-line usage of CNVkit, but clears the way to extract a scikit-genome package that can be installed and used separately from CNVkit for computing with genomic intervals.

Documentation

Link to an example VCF file that contains matched tumor and normal samples and will work nicely with CNVkit.
Describe the breaks command's output columns. (#220)
Show a Python code example customizing a plot with matplotlib.pyplot. (#196)

Dependencies

pysam: Raise minimum to 0.10; support new version 0.11.2.1 (#218; thanks @chapmanb)
pandas: Support new version 0.20.1 (#215)
numpy: Support new version 0.13 (#235, #238)

Commands

batch:

Log the CNVkit version number at the start of the run.
Print a message at the end if no tumor/test samples were specified. (#214)
Clarify error messages for bad option combinations. (#216)
Removed the deprecated, suppressed/invisible option --split. It was a shim in the 0.8 series to support old scripts.

reference:

Ensure the inferred chromosomal sex matches between the targets and antitargets for the same sample. If the inferences do not match, prefer antitargets. (#234, #237)

fix:

Warn & don't reweight bins if most antitargets have no/low coverage. This avoids a variety of surprising downstream problems when the input was specified as hybrid capture (the default), but is actually from targeted amplicon sequencing, or otherwise has no reads mapped to most off-target bins.

segment:

Log the segmentation method and p-value/q-value threshold.

call:

Add option --center-at, for re-centering log2 values at a user-specified neutral value.
The option --center can be used without an argument, in which case it uses the default centering method 'median'.

diagram:

New option --title to add a custom title to the top of the generated figure. (#239; thanks @micknudsen)

export vcf:

When given a .cnr file corresponding to the usual segmented input file (.cns), emit the CIPOS and CIEND tags in the generated VCF. These indicate the "fuzzy" coordinates of segment breakpoints. Here, the ranges are simply the widths of the underlying bins adjacent to each segment breakpoint. These tags can help meta-methods aggregate/harmonize CNVkit's calls with those of other structural variant callers. (#72)

import-picard:

Don't accept directory as an argument (was deprecated).
Be a little more flexible in filenames accepted: instead of requiring input files to be named *.targetcoverage.??? or *.antitargetcoverage.???, strip the full suffix and default to 'targetcoverage.cnn' output suffix, or 'antitargetcoverage.cnn' if input filename contains 'antitarget'. Works the same for filenames following the earlier convention, but now is pretty safe for amplicon targets with arbitrary filenames, and behavior is generally less surprising.

Bug fixes

antitarget: Don't crash if -g/--access is not given (#207)
batch: Don't crash in 'wgs' mode when given just targets (-t) without a FASTA reference genome sequence (-f) -call --filter ampdel: Drop segments with copy number (cn field) between 0 and 5, exclusive, as the documentation indicates. Previously, it was just merging adjacent segments with copy number 1--4, but not dropping them. (#222)
export cdt: Match the CDT spec. Fix a regression in which columns could be swapped/misaligned versus the header. Add a dummy "EWEIGHT" row to ensure Java TreeView starts reading data from the correct line in the file.
export theta: Don't crash on bins where reference is NaN. (#168)
metrics, descriptives: Handle degenerate/trivial cases consistently. (#202)
segment: Handle sample names that are integers with leading zeros. (#213)
sex: Don't crash if chromosomes X and Y are both missing. (#236)
VCF parsing (call, scatter, segment):
- Safely handle small or empty VCF files that previously could trigger a crash during BAF calculation. Now, with an empty VCF an all-blank "baf" will be emitted. (#218, #224; thanks @chapmanb)
- Improve handling of Mutect2 VCF files, somewhat. Mutect2 VCFs are still not recommended as input to CNVkit; try FreeBayes or GATK HaplotypeCaller instead. (#195)

Python API

Moved sub-packages 'genome' and 'tabio' to separate top-level package 'skgenome' (#201). The top-level cnvlib API is mostly the same otherwise, but supporting modules were refactored to decouple skgenome from cnvlib and remove redundancies. In particular:

Split module cnvlib.core split into skgenome.tabio and cnvlib.cmdutil
Remove GenomicArray static method row2label in favor of functions to_label and from_label in new module skgenome.rangelabel.
The SEG writer in 'tabio' now replaces chromosome names with 1-based integer indices, per SEG spec/convention. The export seg command now uses this writer directly.

Scripts

Remove the script coverage_bin_size.py, previously deprecated in favor of the autobin command.
Add skg_convert.py to convert between tabular formats (including BED and UCSC RefFlat).
Deprecate refFlat2bed.py in favor of skg_convert.py.
Add cnn_annotate.py to replace the "gene" field for each bin in a .cnn or .cnr file, given a gene annotation database like refFlat.txt. The need for this comes up occasionally when users notice at the end of an analysis that vendor-annotated targets are not the desired gene names.

- Python
Published by etal almost 9 years ago

py-cnvkit - Version 0.8.5

New 'autobin' command, replacing the script coverage_bin_size.py. Fix some bugs and usability issues. Unit tests improved, especially for the 'cnvlib.genome' sub-package.

Dependencies

Pandas 0.18.1 is once again supported. Previously the minimum version was 0.19.1. (chapmanb/bcbio-nextgen#1836)
Pysam minimum version is still 0.9.1.4, but slightly older versions in the 0.9 series may still work too. (#192)

Commands

autobin:

New command, replacing and extending the script coverage_bin_size.py. The script is still included (and shares most of the same code), but is considered deprecated and will be removed in the 0.9.0 release. (#170)
In 'amplicon' and 'hybrid' modes, ensure sampling regions for coverage is the same in every run by set random seed. (#191)

antitarget, autobin, batch:

Fix an issue in GenomicArray.subtract() that caused some of the expected output regions to be missing. In cases where this caused an entire chromosome to be lost, the coverage_bin_size.py scriptand autobin and batch commands in hybrid mode would crash. (chapmanb/bcbio-nextgen#1799)

batch, diagram:

Fix creation of chromosomal diagrams with --diagram and the diagram command. (#190)

export:

In export seg, use 1-based indexing in the SEG output. (#197)
Fix export cdt format; it was generating Java TreeView (jtv) earlier.

- Python
Published by etal over 9 years ago

py-cnvkit - Version 0.8.4

This minor release focuses on improving usability and fixing some bugs.

Documentation is updated (thanks @kyleabeauchamp for #186).

Dependencies

Raise minimum pandas version from 0.18.1 to 0.19.0
Raise minimum matplotlib version to 1.3.1

Commands

fix, metrics: - Set PRNG seed to ensure reproducible results. The pipeline is now fully repeatable with identical results if run in serial, i.e. without -p.

fix, reference: - Reduce boundary effects (expected log2 and spread values of 0 in some bins) when smoothing biases on very small gene panels, e.g. targeted amplicon sequencing of <5 genes, <100 bins. (#181)

fix: - Don't complain about mismatched sample IDs if antitargets are blank. This allows reusing a blank "MT" file in a shell loop for WGS and amplicon data.

reference: - Make antitargets (antitarget.bed or *.antitargetcoverage.cnn) an optional argument. Previously this argument was required, so processing WGS or amplicon data, which has no off-target regions or reads, required the user to create and provide a blank BED file or appropriately named, empty .cnn files. (#183)

segment: - Don't log "Dropped 0 low-coverage bins". Only log when it actually drops bins.

diagram, heatmap: - Add option --no-shift-xy. Shifting X and Y according reference and sample sex was done in diagram, but not heatmap. Now it's optional in both.

heatmap: - Add a legend of log2 ratio colors to the plot. (#36) - Add options -x/--sample-sex and -y/--male-reference. (#172)

gender/sex: - Rename 'gender' command to 'sex', with shim for backward compatibility. (#182) - In other commands, the -g/--genderargument is renamed to-x/--sample-sex, also with a compatibility shim. Argument valuesxandyare accepted in addition tof/femaleandm/male`, respectively.

import-picard: - Deprecate searching a directory tree for files. It was a vestige of early lab work, and makes a shaky assumption about Picard CalculateHsMetrics --PER_TARGET_COVERAGE output filenames.

API

The do_* function implementations moved to their named modules. The do_* functions can still be called or imported from the cnvlib and cnvlib.commands modules.
All parsing and serialization of "chr:start-end" genomic region labels is consolidated under a new module, cnvlib.genome.rangelabel. These functions are used in in tabio.textcoord, GenomicArray.labels(), and elsewhere to ensure consistent behavior.

Internal

cnvlib.genome: Handle nested bins correctly in the merge, flatten, and intersect modules, functions and GenomicArray methods. Verified with thorough unit tests.
VCF: If the paired normal sample's genotypes are all 0/0 or missing, fall back to --zygosity-freq (inference from b-allele frequency) rather than marking all variants as somatic. Then infer and drop additional somatic SNVs based on genotype after parsing, and only if that wouldn't drop all records. This allows CNVkit to safely distinguish somatic vs. germline in VCFs from Mutect2, though Mutect2 is still not recommended. (#184)

- Python
Published by etal over 9 years ago

py-cnvkit - Version 0.8.3

Bug fixes and a few usability improvements. Notably, for the whole-genome sequencing workflow (batch -m wgs), bin size is now inferred from a sample's genome-wide coverage depth instead of using a fixed value, which should yield better results by default.

Dependencies

scipy: Raise minimum version to 0.15 (for the function scipy.stats.median_test)

New scripts

coverage_bin_size.py: Quickly estimate on- and off-target read depths to suggest reasonable bin sizes to use with the target and antitarget or batch commands. (#170)
guess_baits.py: In case the baited regions for a target capture panel are not known, use sample BAM files from sequencing with that panel to infer the likely captured regions. Works either guided, given a list of potential targets (e.g. all exons in a genome), or unguided, scanning all sequencing-accessible bases in the genome to find areas with elevated coverage.

Both scripts are preliminary and may be removed in a future release.

Global changes

Infer read lengths automatically from the given sample BAM files where needed (coverage and batch). Remove the hard-coded parameter cnvlib.params.READ_LEN. (#74)
Handle VCFs generated by LoFreq. This program does not emit sample genotypes, but locus depths and allele frequencies can be found in the INFO column instead -- unusual but technically within the VCF spec. (#173)

Commands

batch, coverage, segment: - The option -p/--processes can now be used without an argument to specify parallelizing across all available CPUs. The now-optional argument value is the maximum number of CPUs to use; the special value -p 0 was previously used to specify all CPUs (this still works).

batch: - Automatically estimate a reasonable average bin size in the whole-genome workflow, -m wgs, using a fast estimate of a given normal/control sample's genome-wide average coverage depth. (If multiple normals are given, the median-sized sample is used for this calculation.) This allows CNVkit to handle low-coverage/low-pass WGS data better by default. (#170)

coverage: - With --count, count all reads that overlap a region, but trim any portions of each read aligned outside the region from the number of bases counted. The result should now be closer to that without --count.

scatter: - In chromosome-level plots, the displayed x-axis range now matches the specified region (via -c or -g + -w) exactly. Previously, the displayed range depended on the bin locations. (#180)

Bug fixes

antitarget: Handle empty off-target regions safely. (chapmanb/bcbio-nextgen#1696)
export theta: Rename argument --min-depth to --min-variant-depth, matching the equivalent argument in other commands. (#178; thanks @myronpeto)
scatter: Warn, don't crash, if a region in --region-list covers no bins. (#174; thanks @gabeng)

API changes

New module cnvlib.samutil for convenience functions on BAM files, using pysam.
New module cnvlib.autobin supporting the script coverage_bin_size.py. (#170)
Removed sub-package cnvlib.ngfrills, moving most functionality to samutil and tabio.
genome.GenomicArray: New method total_range_size, similar to pybedtools total_coverage()

- Python
Published by etal over 9 years ago

py-cnvkit - Version 0.8.2

This release covers a number of internal changes to improve the stability and consistency of CNVkit, as well as new and improved command options to make more features available from the command line.

Due to a slight change in the binning procedure (see target and antitarget below), newly generated target and antitarget BED files, or a reference generated with batch, may not use the same bin boundaries as earlier versions. CNVkit will check these files for consistency and alert you if your BED or .cnn files do not match because of this change, e.g. running batch from scratch with the same panel but with two different CNVkit versions. If you want to update CNVkit mid-project, either keep using the same reference.cnn file as before for all new samples (as always), or regenerate all your *.targetcoverage.cnn and *.antitargetcoverage.cnn files to build a new reference.

Dependencies

pyvcf: No longer needed. Instead, parse VCFs with pysam, which is noticeably faster and better able to handle newer VCF and gVCF features. (#159)
pysam: Raise minimum version to 0.9.1.4.

Global changes

When extracting a sample ID from a filename, instead of trimming everything after the first '.' character, only drop known or single-part extensions. For example, "Case1.exome.tumor.bam" and "Case1.exome.tumor.vcf.gz" will now resolve to the sample ID "Case1.exome.tumor" instead of "Case1". Output files will be named like "Case1.exome.tumor.cnr" instead of "Case1.cnr", avoiding potential naming conflicts in the batch command when processing multiple samples. (#48)
Always sort regions by genomic coordinates after reading a file. This doesn't modify the input file in-place, but ensures the output files are always sorted the same way.
Gender detection is more robust. It now uses Mood's median test instead of the Mann-Whitney rank test. As a fallback for edge cases, e.g. only one segment per chromosome, it compares difference of weighted medians in autosomes versus sex chromosomes.

VCF parsing: - Improve handling of VCFs from Mutect2 (#122, #153) and bcftools (#146). - Don't reject records where FILTER is 'PASS' or '.'. - VCF options are now consistent across the commands that can use them (call, scatter, segment, export theta and export nexus-ogt). - New VCF option -z/--zygosity-freq to override VCF genotype calls. (#153, #132)

Commands

target, antitarget: - Divide bins evenly, using the same internal mechanism (the new GenomicArray.subdivide() method). Previously, subdivided regions were not always equal-sized as they should have been. Now, the coordinates of newly generated targets from a baits BED file may be a little different than before.

target: - Drop zero-width bins (#167). - Improve assignment of gene names to targets in WGS datasets. (#164) - Accept any supported region format for --annotate, including BED, interval list and GFF, in addition to the already supported UCSC refFlat. The format is detected automatically. (#163) - Raise an error if the given annotations file (refFlat or equivalent) and the given baited/targeted intervals do not have any overlapping chromosomes.

antitarget: - Set the default average bin size to 150kb. Previously, the CLI default was 200kb, but the API default was 100kb; experience shows 150kb works well.

access: - Avoid a possible error when more than 1000 small regions are excluded from a single sequencing-accessible region. (#150)

coverage: - Fix a unicode vs. bytes incompatibility on Python 3. (#147) - Fix a crash if the input BED has more than 4 columns.

reference: - Add -g/--gender option to declare the chromosomal sex of the input sample(s) (same for all), instead of detecting/guessing for each sample. (#161) - Ensure printed table of bad bins is a reasonable width. (#140)

segment: - With a VCF (-v), don't output 'cn1' and 'cn2' columns; calculate the 'baf' column the same as in call. (#148) - Improve memory efficiency somewhat when using a VCF. (#162) - Fix possible 1-base overlap of output segments when using the cbs or flasso methods. Specifically, the start positions were erroneously all shifted 1 base to the left before. (#158)

scatter, heatmap: - Improve rendering of genomes much smaller than the human genome, e.g. yeast, by scaling telomere padding to the total genome size. The blank space at chromosome boundaries was set to a fixed number of basepairs, but is now calculated as 0.3% of the whole genome size (sum of chromosome lengths) -- which works out the same for the human genome. (#155)

scatter: - Add option --segment-color. Now you can choose 'red' if you like.

metrics: - Input -s/--segments is now optional. If not given, compare bin log2 values to chromosome medians instead of segment means.

import-theta, export theta: - Drop sex chromosomes, since THetA2 doesn't handle them well. (#103, #153)

API

tabio: - Read new formats: GFF (simply); UCSC genePred refFlat; sub-formats bed3, bed4 - Detect more formats with tabio.read_auto: BED, interval list, text coordinates (chr:start-end), refFlat, GFF, TSV with column names. - Remove module ngfrills.regions, no longer needed.

GenomicArray: - Moved to new sub-package 'genome' - Rename method select to filter - Rename method match_to_bins to into_ranges and generalize. - New methods flatten, merge, resize_ranges, subdivide, subtract

In general, the 'genome' functionality can be reached by using the tabio sub-package to load a GenomicArray instance and use its methods directly:

``` from cnvlib import tabio regions = tabio.read_auto(filename)

Generate 500bp flanking regions

flanks = regions.resize_ranges(500).merge().subtract(regions) ```

- Python
Published by etal over 9 years ago

py-cnvkit - Version 0.8.1

This is primarily a bugfix release. The documentation is also improved, particularly covering the cnvlib API.

API: - For convenience in scripting, the relevant functions for running each CLI command (cnvlib.commands.do*) are exported to the top level. For example: `import cnvlib; cnvlib.dobatch(...)`

Bug fixes: - access: Avoid a type-validation error on Python 3. (#141) - batch: Parallel processing now selects an appropriate number of workers for each step of the pipeline, reducing CPU contention when processing multiple samples in parallel. (#138) - call: Apply the ci and sem filters before calculating b-allele frequencies and absolute copy number, as these filters can alter the final calls. - reference: Safely handle an edge case in detecting gender from sample coverage depths when all bins have identical coverage depth, e.g. no coverage. (#144) - segment: Fix handling and segmentation of SNV allele frequencies from a VCF. Ensure output column ordering is correct. Avoid a crash that could occur when SNV segmentation produces a segment that does not cover any coverage bins. (chapmanb/bcbio-nextgen#1590) - cnvlib.tabio: Improve handling of empty files, including VCFs with no samples and/or no locus records. If records and samples are present but genotypes are missing or undetectable, scatter, call and export would previously reject all records when filtering for SNPs, but will now accept all records instead.

- Python
Published by etal over 9 years ago

py-cnvkit - Version 0.8

This is a larger release and the first update since our publication.

CNVkit now runs under Python 3 as well as 2.7. (#3, #101; thanks @mpschr)

File format changes: - New "depth" column in .cnn, .cnr, .cns - In .cns, "weight" is the sum, not mean, of bin-level weights within the segment

New script cnn_updater.py can be used to add the "depth" column to existing .cnn, .cnr and .cns files. However, most CNVkit commands should still work with pre-v0.8 files without using this script first. For best results, rebuild the .cnr and .cns for an ongoing study using the existing targetcoverage, antitargetcoverage and reference .cnn files.

Algorithmic changes: - reference, gender, call, diagram, export: Gender, or chromosomal sex, is now inferred with a statistical test instead of a fixed threshold, significantly improving the inferences on noisy or aneuploid samples. (#116) - reference, fix, call: Center log2 values by median of chromosome medians, by default. (#114) - reference, metrics, segmetrics: Improve the calculation of biweight location and biweight midvariance (now in descriptives.py).

These deprecated components (since 0.7.x) have been removed: - Commands rescale and loh -- use call and scatter, respectively, instead - Some options in export bed and export theta -- use call first instead - Script genome2access.py -- use cnvkit.py access instead

Updated commands:

batch: - New option --method, with choices "hybrid" (default), "wgs", "amplicon", to simplify/streamline usage with whole-genome or amplicon sequencing protocols. See documentation for details; in short, "wgs" and "amplicon" do not use antitargets or the edge/density bias correction; "wgs" by default uses the sequencing-accessible genome as the targets, and uses a more stringent significance threshold for segmentation. - Hide/deprecate --split option; it's always on now. To ensure bin coordinates do not change between batch runs (they generally won't anyway), use the -r/--reference option instead of specifying -t and -a in batch. - Add --drop-low-coverage option, which is passed to segment internally. - The -p/--processes option is also passed to coverage and segment internally (see below).

antitarget: - Increase the default average bin size from 100kb to 200kb.

coverage: - Parallelize coverage calculation over BED rows. The number of threads can be specified with the -p option. (#121; thanks @brentp)

segment: - Parallelize CBS and Haar segmentation methods across chromosomes. (#123, #125; thanks @brentp)

call: - New --filter option, with choices 'cn', 'ampdel', 'ci', 'sem' implemented. - With VCF b-allele frequencies (-v, 'baf'), always calculate the allele-specific integer copy numbers 'cn1' and 'cn2' so that 'cn1' is the larger one. BAF mirror direction stays majority-rules. (#105; thanks @mpschr) - If b-allele frequencies are used and total copy number is zero, report allelic copy numbers as 0, not NaN.

scatter: - Add --title option. - Allow selecting & labeling gene(s) w/ only segments as input.

heatmap, scatter: - Allow saving plots in any image file format supported by matplotlib, not just The file format is determined by the output filename's extension, e.g. 'png' saves in PNG format -- making it easier to integrate CNVkit plots with HTML reports. (#120; thanks @chapmanb)

diagram: - Add -g/--gender option to specify sample's known gender.

gainloss: - Make output tables more consistent across options. Show individual gene names (rather than all genes grouped within a segment in 1 row); don't show rows with no gene name; report the segment probe count instead of number of probes within the gene; show any extra columns present in the input .cns file. (#107, #108; thanks @mpschr)

gender: - Show column headers and Y-chromosome log2 values in the output table.

segmetrics: - Add stats options for mean, median, mode - Add MSE, SEM stats as options

metrics, segmetrics: - Add --drop-low-coverage option (like in segment and gainloss)

Internals: - New sub-package tabio: a more robust I/O framwork unifying support for tabular formats, including CNVkit's .cnn/.cnr/.cns, BED, SEG, VCF, GATK/Picard interval list, and text coordinates (chr:start:end). Base class GenomicArray and its derived classes CopyNumArray and VariantArray do not implement their own I/O, but rather are instantiated via tabio. The "import-" commands use this as well. - Removed rary.RegionArray; all functionality is now in tabio and GenomicArray. - New module "descriptives.py" implements descriptive statistics on plain numpy arrays or pandas Series instances, independent of CNVkit. - Better testing on Travis, covering Python 2.7, 3.4 and 3.5, on both Linux and OS X (thanks @kyleabeauchamp, @rmcgibbo, and @mpharrigan; #110)

Bug fixes: - batch: Errors in parallel processes will immediately be raised as exceptions at the top level, rather than dying silently. Previously, no error would occur until a missing output file was needed later in the pipeline. (#55) - segment: - Skip possible R warning text when parsing CBS output (#106) and run Rscript with the --vanilla option (#112; thanks @jsmedmar). Non-isolated R processes were prone to add various warning messages to the expected SEG output, which could crash the "segment" command for some users. - Handle zero-weight bins better (#128; thanks @chapmanb). - scatter: - Handle selected segments with an empty gene name (#104; thanks @mpschr). - Don't crash on zero-length GenomicArray/CopyNumArray inputs. - VCF parsing (now within tabio) improved: - More robust to missing genotype (GT) & depth (DP) fields (#102) - Handle VCFs from MuTect2 (#122) - export theta: don't crash when SNP VCF is a single, unpaired sample, or if segmented input (.cns) is empty. - heatmap: Avoid a possible crash if a sample is missing a chromosome.

Packaging: - Universal wheels are enabled for installation with pip (setup.cfg).

New & updated dependencies: - futures - futurize - numpy raised to version 1.9 - pandas raised to version 0.18.1 - pysam version 0.9.1.1 is specifically excluded

- Python
Published by etal almost 10 years ago

py-cnvkit - Version 0.7.11

New dependency on pyfaidx, a Python library for handling samtools-style FASTA indexes (.fai).

export vcf: - Add CNVkit version and current date (i.e. local calendar date that the "cnvkit.py export vcf" command was run) to the VCF header.

export theta: - Given a VCF of SNVs called jointly in paired tumor and normal samples, extract SNP allele counts to THetA2's custom input format ("snpformatted.txt"). The two additional files CNVkit generates this way can be used with THetA2's "--TUMORSNP" and "--NORMAL_SNP" options to improve estimates of tumor purity and clonality. - Use CNVkit's segment weights and probe counts to estimate normal-sample read counts for each segment if no copy number reference profile (.cnn) or paired normal sample (.cnr) is given. The command's second argument is now optional and deprecated in favor of the -r/--reference option, which does the same thing.

import-theta: - Save integer copy number in the "cn" column of the output file(s) (CNVkit's .cns format).

call, export nexus-ogt: - When reading structural variants from a VCF file, interpret the END tag as the variant end position, not the length, per the VCF 4.2 specification. This bug could cause the b-allele frequencies calculated in call and export nexus-ogt to be erroneously repeated across many consecutive bins.

scatter: - When loading CNVkit files (in any command), identify and drop rows with "NaN" log2 values. (CNVkit never emits these, but they could happen if a user generates .cnr files from Illumina CGH array data files using a custom script.) The other rows (spread, gc, rmask) can be NaN without a problem, but plotting with scatter would crash when adjusting the y-axis based on NaN log2 values. (#95) - Detect & warn if input .cnr/.cns/.vcf is not sorted by genomic coordinates. This could happen if the input VCF or manually constructed .cnr/.cns file (not generated by CNVkit) was not sorted by genomic coordinates. Then the error message was cryptic, because some bins/segments/SNVs were selected successfully but plotting crashed when laying out the x-axis coordinates.

Internals & packaging: - Use the pyfaidx library to extract sequences from a genome FASTA file (used in the reference command), replacing some custom code in cnvlib. (#73; thanks @mdshw5) - Documentation updates.

- Python
Published by etal about 10 years ago

py-cnvkit - Version 0.7.10

Version 0.7.10

diagram: - Label genes even when given only segments (.cns). Plotting segments alone, without bin-level copy ratios (.cnr), can be convenient to produce an uncluttered PDF with a smaller file size while retaining most of the important CNV information. (#94)

scatter: - For calculating and plotting SNV b-allele frequencies, select the sample of interest from the given VCF based on the .cnr/.cns base filename, unless specified with --sample-id.

export nexus-ogt: - Use normal-sample BAFs if normal-sample .cnr given. Previously, it would load tumor BAFs (taking the first tumor sample from the PEDIGREE tag) even if the properly-named .cnr file was for the normal sample in the VCF. - Add --sample-id option to select VCF sample. Useful in case .cnr filename base doesn't match the sample IDs in the VCF header. - Add filtering options --min-weight, --min-variant-depth. - The --min-variant-depth option works the same as in scatter -v, filtering SNVs by coverage depth (INFO field DP, usually) for the b-allele frequency calculation. - The --min-weight option allows the user to discard low-weight bins since Nexus Copy Number doesn't use CNVKit's weights for its own segmentation and could be misled by the noisier log2 ratios in less-reliable bins. For choosing the cutoff value, 0.5 is suitable in our experience, but check the distribution of weights in your own data first.

export vcf: - Add custom VCF "FORMAT" fields: FOLDCHANGE, FOLDCHANGE_LOG2, PROBES. (#91; thanks @pcingola)

segment: - The "flasso" method now works again; it was broken for a few releases. (#88; thanks @pcingola)

Packaging & internal: - Add GRCh37 "access" BED file for users' convenience. The access command will also now raise an error if the chromosome names don't match between the "access" and "target" BED files. - Work with the latest version of pysam (0.9). (#86) - Silence some superfluous warnings from the latest version of pandas (0.18). - Documentation updates, including more details on the call command.

- Python
Published by etal about 10 years ago

py-cnvkit - Version 0.7.9

Bug fixes, most importantly to work around an API change in pysam.

Installation: - Require pysam version earlier than 0.9 (#86)

fix, reference: - If the majority of target bins have no or very low coverage, warn the user about this, skip bias corrections, and mask out the low-coverage target bins during centering to ensure the output is still vaguely usable and sane. This issue could occur because the wrong target BED was used initially, or maybe hybridization failed in library prep.

reference: - Ensure the output table's columns are ordered correctly. In some cases it was possible for the output tables columns to be ordered differently, which still works in CNVkit, but is weird.

call, rescale, export: - Check specified gender more sensibly; on failure, default to female. Specifically, use case-insensitive string comparison to test whether the given argument means "male". Treating chrX as having neutral ploidy is probably a less surprising fallback, especially if the "-y" flag is forgotten elsewhere in the pipeline.

- Python
Published by etal over 10 years ago

py-cnvkit - Version 0.7.8

New features in the call command make it more amenable to analyzing tumor heterogeneity, and also make the rescale command redundant. Documentation is updated with more methodological background info.

call: - Put absolute copy number in a new "cn" column. When rescaling log2 ratios for purity, do not round to integer absolute copy number values. (#83) - New -v/--vcf option: Calculate b-allele frequency (BAF) average for each segment and output as a new column "baf". Rescale BAFs if --purity is specified. Then, using BAF and total copy number (CN, the "cn" column), assign major and minor allele copy number to each segment and output as new columns "cn1" and "cn2". These values can indicate allelic imbalance, including loss of heterozygosity (LOH). (#84) - New --center option that works the same as in rescale. - New method -m none to perform any specified transformations (rescaling, re-centering, adding b-allele frequencies), but do not call integer copy numbers.

rescale: - Deprecated in favor of call with the -m none option, which does the same thing. - If recentering is specified with --center, do it before, not after, rescaling log2 values for tumor sample purity.

export bed, export vcf: - Take absolute copy number from "cn" column if present (#83)

antitarget: - Whitelist chromosomes X and Y along with integer chromosome names for inclusion as canonical mammalian chromosomes. Keep the fallback to "short" chromosome names if no such canonical chromosome names are detected. (#37)

reference: - Expose bias corrections (GC, RepeatMasker, targeting density) as command-line options --no-gc, --no-rmask, and --no-edge, similar to the fix command. (#80)

Internal: - VariantArray.readvcf: somatic mask was the opposite of what it should have been, i.e. skipsomatic was skipping germline and retaining only somatic SNVs.

- Python
Published by etal over 10 years ago

py-cnvkit - Version 0.7.7

Small improvements, bugfixes, and documentation updates.

fix: - Removed the hard filter on RepeatMasker fraction of antitarget bins. This filter doesn't appear to improve calling on current benchmarks. - Drop bins that have very high coverage in the reference, in addition to the low-coverage bins already dropped (normalized log2 values outside +/- 5). - Ignore very-low-coverage bins when recentering (by default). For good-quality samples this doesn't make much difference, but it's safer and seems to improve the centering slightly on lower-quality samples. - Ensure antitarget bin weights are not set to 0 if the majority of target bins have no coverage -- this would cause segmentation to fail. (#82) - Don't crash if antitargets are empty (to support WGS and targeted amplicon capture), fixing a regression.

antitarget: - Keep untargeted contigs that appear to be "canonical" chromosomes. Prefer chromosomes with numeric names (autosomes in most mammalian reference genomes); but if none of the targeted chromosomes have numeric names, then fall back to chromosomes with names no longer than the longest-named targeted chromosome. (#37)

batch: - Disallow input BAMs with duplicate base filenames (#81). Now it will trigger an error instead of overwriting some output files.

segment: - --drop-outlier option now masks outliers according to multiples (default 10x) of the 95'ile, not 90'ile. Benchmarking looks better.

Plots scatter, heatmap: - With the "-c/--chromosome" option, handle unbounded ranges (e.g. "chr1:100-" or "chr5:-100000") treating the missing start/end of the range as the start/end of the specified chromosome.

heatmap: - A more efficient implementation. Now, plotting a heatmap of .cnr is feasible, and behavior is a bit more consistent (e.g. placement of rectangles is more accurate; plotting a selection where only some samples have data will still show all samples). - Don't crash if selection overlaps no segments, e.g. if the selection is a centromeric or telomeric region. Previously it would crash with an obscure error.

Misc. bugfixes: - batch: log # parallel processes correctly for "-p 0" - import-theta: fix crash; namedtuples are immutable (#77) - metrics: require --segments (closes #79) - rescale: fix crash if --purity is not specified - VariantArray: Fix VCF parsing if filters are not used.

- Python
Published by etal over 10 years ago

py-cnvkit - Version 0.7.6

Minor bugfixes and improvements.

scatter: - Tweaked plot colors for better visibility and accessibility: points are slightly darker, and segments are now a deep gold color instead of red.

fix: - Downweight targets or antitargets proportionally to their relative variability of bin log2 values; i.e. if targets are twice twice as variable (by interquartile range of bin log2 values) as antitargets, divide all target bin weights by 2. This happens after all bias corrections and reference normalization, and appears to improve the final segmentation results.

antitarget: - Don't emit antitargets for untargeted chromosomes with long names, e.g. "chr6apdhap1" -- these are presumably alternative/unassigned contigs, not real canonical chromosomes that deserve to be included for CNV calling. But do continue to keep untargeted chromosomes with names up to the length of the longest-named targeted chromosome. (Improves on #37) - Indicate default --min-size in the help message.

batch: - Log the number parallel processes correctly when "-p 0" is used to automatically detect the number of CPUs -- previously, this option would print on the console that samples were being run in serial, but then launch multiple parallel processes.

segment: - Change the --drop-outliers default value from 5 to 10, based on performance in benchmarking.

Internally: - Fixed detection of autosomes to be used for re-centering bin log2 values and detecting gender. - Fixed parsing the GATK/Picard "interval list" file format - strand and name were swapped.

- Python
Published by etal over 10 years ago

py-cnvkit -

Version 0.7.5

Global speedups, friendlier error handling and miscellaneous bug fixes. Documentation updates (thanks @kyleabeauchamp; #67). Expanded unit tests & restored continuous integration (TravisCI). Raised the minimum pandas version to 0.17.1, the latest.

rescale (new command; #64): - Adjust .cnr or .cns files for normal contamination or subclone fraction. - Re-center log2 values by median (the usual), mode, mean, or biweight location.

segment: - Detect outlier bins and ignore them during segmentation using a method similar to BIC-seq. Command line option: --drop-outliers; any outlier bins found will be logged.

coverage: - If the given target BED files is missing the 4th column (gene names), fill in the dummy name "-" instead of crashing.

segmetrics: - Expose alpha and number of bootstraps as command-line options -a/--alpha and -b/--bootstrap for calculating confidence intervals.

antitarget: - Reduce default bin size from 150kb to 100kb.

fix: - Speed improvements: now about 20 times faster on exomes.

API changes: - Gene names to treat as meaningless and to ignore in reporting (by default "-", ".", "CGH") can be globally configured in cnvlib/params.py (params.IGNOREGENENAMES). - vary.VariantArray (used in scatter) can now parse VCF files with no samples (genotypes) as a table of plain loci.

- Python
Published by etal over 10 years ago

py-cnvkit - Version 0.7.4

This is primarily a bugfix release.

export: - bed --show variant now filters CNAs on sex chromosomes correctly, taking reference and sample genders into account. - nexus-ogt format now emits BAFs more similar to the original VCF allele frequencies. Previously, if multiple SNVs fell into a single CNVkit genomic bin, the allele frequencies of those SNVs would all be "mirrored" above 0.5 before taking the median. Now the SNVs are mirrored in the direction of the majority of the SNVs in the bin, whether above or below 0.5, so that the output looks more balanced and low-frequency SNVs are more apparent.

heatmap: - Sub-chromosomal regions can now be selected for display with the -c option, e.g. -c chr7:125000000-145000000, just like the same option in scatter.

segment: - Fix the listing of gene names in each segment in the output .cns file. Previously, briefly, each gene's name was truncated to 1 character.

- Python
Published by etal over 10 years ago

py-cnvkit - Version 0.7.3

access: - New command equivalent to the now-deprecated genome2access.py script.

target, antitarget: - Always write output files in 4-column BED format.

scatter: - Copy ratios (.cnr) are no longer required. Without this input file, behavior is similar to the now-deprecated loh command, but still more flexible. - VCF input file can include multiple tumor samples and PEDIGREE tags; if a tumor sample ID is specified, all PEDIGREE tags will be checked to find the matching normal sample. - VCFs processed by CLC Genomics Server are now parsed correctly.

loh: - Deprecated. Use scatter with -v and no .cnr file instead.

segment: - Preliminary support for segmenting SNP allele frequencies from a VCF in addition to total copy number (-v option). Details are likely to change in a later release. (#34) - In the weight column of the output file, values are now the sum, not the mean, of the weights of the probes covered by that segment. - The haar segmentation method is improved to avoid duplicate breakpoints and run much faster.

export bed: - Deprecate --show-all in favor of --show with possible arguments all (like --show-all), ploidy (default behavior), or variant (show the same regions as export vcf).

export vcf: - Fix a typo in the SVLEN tag definition in the VCF header -- Number should be 1, not -1 which caused GATK parsing to fail. (#57; thanks @chapmanb)

Python library cnvlib: - Logging is now done with the Python standard library's logging module, making it easier to silence or redirect status messages. In particular, unit tests run more quietly. (#52) - Internal refactoring (including new features in GenomicArray, RegionArray, VariantArray) resulting in changes to the cnvlib API , as well as some performance improvements.

- Python
Published by etal over 10 years ago

py-cnvkit - Version 0.7.2

A variety of mostly minor improvements and bug fixes over v0.7.1.

segment, gainloss, segmetrics: - Don't exclude very-low-coverage bins from calculations by default; instead, expose this option as --drop-low-coverage. (This option usually helps on tumor samples with some normal contamination, but leads to problems on germline samples with homozygous deletions.)

segment: - Output .cns files now have a "weight" column which is the mean of the weights of the bins it covers. - Output of the 'haar' segmentation method now has each segment's gene names listed, as with the other methods. - Fixed a bug where every segment's probe count (the "probes" column) could be overwritten with the _ character. (#53; thanks @chapmanb)

segmetrics: - Each statistic is now printed in its own column, instead of squeezing all stats into the "gene" column. The confidence/prediction interval stats get two columns, _lo and _hi (lower and upper bound).

loh, scatter: - Given a VCF called on a tumor-normal pair, use the paired normal to select appropriate germline SNPs for plotting.

export: - New format "nexus-ogt" combines bin-level copy number ratios with b-allele frequencies given a VCF and a .cnr file. This replaces "nexus-basic" with the -v option that was introduced in v0.7.1; "nexus-ogt" stores the same info but can be viewed in BioDiscovery Nexus Copy Number without any special configuration (load it as the "Custom-OGT" data format). - Renamed bed option --show-neutral to --show-all. - vcf option -g/--gender now works properly for identifying CNVs on sex chromosomes.

call: - Fixed the threshold method to calculate absolute copy number on sex chromosomes correctly. (#49; thanks @tskir)

- Python
Published by etal over 10 years ago

py-cnvkit - Version 0.7.1

This is primarily a bugfix release. Many more unit test cases were added to the automated test suite. Code coverage is now monitored at Codecov (thanks @stevepeak).

export nexus-basic: - New optional argument -v/--vcf extracts SNV b-allele frequencies from the given VCF file, matches them to the bins in the .cnr file, and prints an additional "baf" column in the output table. These allele frequencies can then be viewed in Nexus Copy Number, similar to a SNP array.

call: - Fixed a bug in the threshold method where the copy number of haploid chromosomes was twice what it should be. The clonal method already handled these chromosomes properly. (#49)

reference: - Handle blank/empty antitarget BED and coverage (.cnn) files. This was a regression from earlier releases in v0.7.0. (#51) - When calculating GC and RepeatMasker values, catch invalid BED ranges that extend beyond the length of the chromosome and raise an informative error. This would error before, too (in ngfrills.faidx), but the message would be baffling.

fix: - Catch duplicated target ranges, e.g. the exact same bait labeled with two different gene names, and report those ranges in the error message. The target command's --split option should usually fix these, but sometimes it's not used.

- Python
Published by etal over 10 years ago

py-cnvkit - Version 0.7.0

CNVkit now depends on pandas, SciPy, and PyVCF. The internals were largely rewritten, so please report any bugs or other regressions you find.

Documentation is much improved.

export: - VCF format is supported (#5, #41). The generated VCFs are compatible with many third-party tools, including development versions of MetaSV. (Thanks @chapmanb) - Removed the "freebayes" sub-command; use "export bed" instead.

segment: - The names of genes (or other targeted loci) covered by each segment are now included in the output .cns file. - The p-value or q-value threshold (depending on the method) can now be specified with -t/--threshold. - The "haar" method works properly now (#6). This segmentation algorithm is implemented in Python and does not require R to run. It is a bit faster than CBS, but not as accurate.

loh: - Plot variant allele frequencies (VAFs) as their actual values, 0 to 1, instead of the mirrored b-allele frequency (0.5 to 1). Draw segment mean allele frequencies separately above and below 0.5. This matches how the equivalent SNP array data are typically viewed.

antitarget: - Generate off-target bins for all chromosomes present in the "access" BED file, not just those where targeted regions occur. (#37)

coverage: - A minimum read mapping quality (MAPQ) value can now be specified with -q/--min-mapq. The default value is 0, i.e. reads are no longer excluded for low MAPQ or ambiguous mapping location. This should generally improve calling accuracy and avoid some spurious deletion calls.

- Python
Published by etal almost 11 years ago

py-cnvkit - Version 0.6.1

Small fixes in segmentation, affecting the output of segment and preventing crashes in segmetrics: - Exclude fewer low-coverage bins from segmentation (using a lower minimum coverage threshold). - In case the first or last bins on a chromosome were excluded from segmentation, adjust the first and last segments on each chromosome so that their endpoints match the first and last bins. - If no bins on a chromosome passed the coverage filter, instead of omitting the chromosome from segmentation output, generate a single segment covering the full chromosome, with segment log2 ratio 0.0. (So, all chromosomes in the .cnr file will be present in the .cns file, too.)

- Python
Published by etal almost 11 years ago

py-cnvkit - Version 0.6.0

Added two new commands, call and segmetrics, and a new export format, BED.

segmetrics: - Calculates summary statistics of the residual bin-level log2 ratio estimates from the segment means, similar to the existing metrics command, but for each segment individually. Results are output in the same format as the CNVkit segmentation file (.cns), with the stat names and calculated values printed in the "gene" column. - Supported stats: - standard deviation, median absolute deviation, inter-quartile range, Tukey's biweight midvariance (as in metrics); - confidence interval, estimated by bootstrap; - prediction interval, estimated by the range between the 2.5-97.5 percentiles of bin-level log2 ratio values within the segment. - Thanks to @mjafin for suggesting this feature (#28).

call: - Given segmented log2 ratio estimates (.cns file), round the copy ratio estimates to integer values using either: - A list of threshold log2 values for each copy number state, or - Some algebra, given known tumor cell fraction and normal ploidy. (This was previously available through the export freebayes command, see below.) - The output is another .cns file, where the values in the log2 column are still log2-transformed, but represent integers in log2 scale -- e.g. a neutral diploid state is represented as "0.0", not the integer 2. These output files are still compatible with the other CNVkit commands that accept .cns files, and can be plotted the same way.

export bed: - New bed format supporting the same features as export freebayes that were not moved into the call command (see above). The output BED file is still compatible with the FreeBayes --cnv-map option. In addition, export bed has the new option --show-neutral to also output neutral-CN segments/regions, in addition to the CNV regions output by default. - The export freebayes sub-command is deprecated but still available in this release; it will be removed in the next release. This command supported the tumor-purity adjustment now implemented in the call command. The recommended approach is to instead run call first on each .cns file, and then export bed on all the adjusted .cns files to get an equivalent BED file compatible with FreeBayes --cnv-map option.

Smaller changes: - gainloss: Reduced the default log2 ratio threshold from .5 to .2 - import-picard: Use the un-normalized mean coverage instead of the normalized coverage of each target as the log2 coverage values in the output .cnn file. This matches the output of the coverage command; CNVkit normalizes coverages later in the pipeline. - Some internal refactoring. Please report any bugs, real or perceived, on our GitHub issue tracker.

- Python
Published by etal almost 11 years ago

py-cnvkit - Version 0.5.1

Bug fixes for two edge cases in whole genome analyses (thanks @chapmanb): - reference: Merging target and antitarget .cnn files where antitargets are empty - diagram: Avoid trying to plot segements over the start or end of chromosomes

- Python
Published by etal about 11 years ago

py-cnvkit - Version 0.5.0

This release includes a variety of improvements to CNVkit's calling accuracy and robustness. All CNVkit files built with previous versions will continue to work with this version, but for best results, I recommend rebuilding your reference.cnn file(s) from the targetcoverage.cnn and antitargetcoverage.cnn files.

coverage: - Output target/antitarget coverage (.cnn) files are no longer median-centered. Read depths in each bin are still log2-scaled, but the observed read depth can now be easily recovered from .cnn files.

reference, fix: - Include a "flat pseudocount" in addition to the given normals, making paired tumor-normal calling much more robust and accurate. - Perform bias corrections on the input normal samples before calculating the average and spread of log2 values.

fix: - Do bias corrections before subtracting the reference, instead of after, because the reference already includes bias corrections now. - In addition to weighting bins by spread (which can only be observed with a pooled reference), also weight by bin size and deviation of reference log2 values in each bin from the global median. So, useful bin weights are now derived from "flat" and single-normal-sample references, too.

segment: - Recalculate CBS segment means using bin weights (in the R library this simply the mean, arguably a bug). - Set CBS segment start/end positions to match the underlying bin start/end positions. - Improved centromere detection -- only exclude one "large gap", if any, from each chromosome. - Tuned CBS calling parameters to improve accuracy (see benchmarks in the repo etal/cnvkit-examples).

diagram: - Label genes using the same criteria as the gainloss command: if segments are given, use the segment value at each gene, otherwise calculate the weighted average of bin-level log2 values within each gene. - New option -m/--min-probes to match gainloss. - Guess gender from chrX more reliably, so that the same gender is called from the bin-level (.cnr) and segmented (.cns) values given.

scatter, loh: - When plotting allele frequencies from a VCF, if segments are given (.cns), also apply those segments to allele frequencies to show LOH regions that match CNVs. - Skip somatic variants identified in a VCF, and try to retain only germline variants, when plotting LOH. (This is not very well standardized across callers, so please watch for bad behavior from callers other than FreeBayes and MuTect, and let me know about it!) - scatter only: Added options --y-min, --y-max to set y-axis limits on the plot. - Removed the deprecated -r option. Use -c instead.

The long-deprecated cbs command has been removed. Use segment instead.

Bugs in parsing and writing empty and 1-line VCF, BED and CNVkit files, and other VCF quirks, have now been fixed (Thanks @chapmanb!)

- Python
Published by etal about 11 years ago

py-cnvkit - Version 0.4.1

New features: - scatter command: Option -c can now take coordinate ranges like -r, so -r is deprecated and will be removed in the next release. - genome2access.py script: New -x option to exclude additional regions. Added a new file "data/access-5k-mappable.hg19.bed" which used this option to exclude the Encode "Duke" and "Dac" low-mappability regions.

Also: - Improved the help/usage messages for several commands. Added a "version" command that prints the current CNVkit version. (Thanks @HenrikBengtsson) - Tuned CBS calling parameters to improve segmentation accuracy according to some benchmarks. - Sped up a few slow functions identified by profiling. In particular, metrics is much faster now. - Fixed bugs/incompatibilities in plotting commands and cleaned up the source code (Thanks @chapmanb and @roryk)

CNVkit can now be obtained and run as a Docker container: https://registry.hub.docker.com/u/etal/cnvkit/

- Python
Published by etal about 11 years ago

py-cnvkit - Version 0.4.0

New features: - Plotting ( scatter and loh commands): - Support VCFs from more callers, including MuTect, VarScan and FreeBayes. Support multi-sample VCFs; the sample in the VCF can be selected by name with the -i option, and will also be shown as the plot title. Thanks to Brad Chapman (@chapmanb) for this contribution. (#11) - Enable highlighting of selected regions other than genes using the -r and -w options. The plot title (sample ID) can also be specified with -i/--sample-id. Thanks to Brad Chapman (@chapmanb) for this contribution. (#9) - New -l/--range-list option to plot a BED file of regions, each in its own plot, and combine the generated plots into a single multi-page PDF file. Thanks to Rory Kirchner (@roryk) for this contribution. (#21) - FreeBayes export format can now handle multiple samples (.cns files).

Changes: - Renamed --male-normal option to --male-reference (but kept -y alias) in all commands that had it. - export options: Specify sample name with -i/--sample-id option instead of -n. - scatter plotting command: added --min-variant-depth option to match loh. (#10) - The loh plot command does not attempt significance testing anymore; we're working on a better solution. (#10, #18)

Bug fixes: - Handle empty BED/region/interval_list files, so that an empty "antitarget" file can be used when analyzing WGS or targeted amplicon capture datasets. (#19) - Ignore "." labels for genes, the same way we already ignore "-" labels, for better interoperability with BEDtools. Thanks to Brad Chapman (@chapmanb) for this contribution. (#12) - Accept "sample.bai" as index for "sample.bam". (#8) - SEG import: The option --from-log10 now works to convert log10 ratio values to log2 scale.

Documentation has also improved substantially, including the installation instructions. The built-in help text for each command now shows default values for each option, where applicable.

- Python
Published by etal about 11 years ago

py-cnvkit - v0.3.3

Aesthetic improvements to plots
Fixed an edge case where a very small BAM file could mistakenly appear to be unsorted.

- Python
Published by etal over 11 years ago

py-cnvkit -

Enable batch to be run without specifying tumor samples, in order to only create a reference.
Copy ratios are now re-centered at the smoothed "mode" (peak density) rather than median, for better behavior on samples with many large-scale losses.
Minor fixes and improvements to several safety checks in response to feedback from users.

- Python
Published by etal over 11 years ago

py-cnvkit -

This is a bugfix release. - coverage: Allow spaces in gene names. - target: If gene names are not provided (BED3 format), use a default gene name "-". (Previously, the generated targets BED file would cause coverage to crash later.) - Fixed the math underlying the edge/density correction.

- Python
Published by etal over 11 years ago

py-cnvkit -

antitarget: Choose an appropriate default minimum bin size based on the user-provided average bin size.
breaks: Added an option --min-probes, which applies to both sides of the detected breakpoint within a gene.
export: Fixed the nexus-basic format to work with Biodiscovery Nexus Copy Number again.
target, refFlat2bed.py: Handle overlapping gene annotations a little better.
Made the dependency on SciPy "soft", only triggered when using the segmentation algorithm haar.

- Python
Published by etal over 11 years ago

py-cnvkit - v0.2.1

Updated dependency versions of Python packages numpy and matplotlib to match Ubuntu 12.04 package versions. This fixes a confusing installation issue related to having an old version of Biopython already installed.

- Python
Published by etal over 11 years ago

Recent Releases of py-cnvkit

py-cnvkit - Version 0.9.12

Fixes

New Contributors

py-cnvkit - v0.9.11

Version 0.9.11

New features

Bug fixes

Other updates

New Contributors

py-cnvkit - Version 0.9.10

New features

API and CLI changes

Documentation

Bugfixes

Internal

py-cnvkit - Version 0.9.9

New script

Bug fixes by @tskir

Improvements from other contributors

py-cnvkit - Version 0.9.8

py-cnvkit - Version 0.9.7

py-cnvkit - Version v0.9.7 beta

Dependencies

Commands

API

Bug fixes

py-cnvkit - Version 0.9.6

Dependencies

Commands

Python API

Bug fixes

py-cnvkit - Version 0.9.5

py-cnvkit - Version 0.9.4

Commands

Python API

Bug fixes

py-cnvkit - Version 0.9.3

py-cnvkit - Version 0.9.2

Dependencies

Scripts

Commands

Bug fixes

Python API

py-cnvkit - Version 0.9.1

Dependencies

Scripts

Commands

Python API

py-cnvkit - Version 0.9.0

Documentation

Dependencies

Commands

Bug fixes

Python API

Scripts

py-cnvkit - Version 0.8.5

Dependencies

Commands

py-cnvkit - Version 0.8.4

Dependencies

Commands

API

Internal

py-cnvkit - Version 0.8.3

Dependencies

New scripts

Global changes

Commands

Bug fixes

API changes

py-cnvkit - Version 0.8.2

Dependencies

Global changes

Commands

API

Generate 500bp flanking regions

py-cnvkit - Version 0.8.1

py-cnvkit - Version 0.8

py-cnvkit - Version 0.7.11