ngcgh

Tools for producing pseudo-cgh of next-generation sequencing data

https://github.com/seandavi/ngcgh

Science Score: 33.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
✓
Committers with academic emails
1 of 4 committers (25.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary

Keywords

bioinformatics cancer-genomics genomics python sequencing

Last synced: 6 months ago · JSON representation

Repository

Tools for producing pseudo-cgh of next-generation sequencing data

Basic Info

Host: GitHub
Owner: seandavi
Language: Python
Default Branch: master
Homepage:
Size: 54.7 KB

Statistics

Stars: 18
Watchers: 5
Forks: 9
Open Issues: 5
Releases: 0

Topics

bioinformatics cancer-genomics genomics python sequencing

Created almost 15 years ago · Last pushed over 9 years ago

Metadata Files

Readme

README.rst

.. image:: https://zenodo.org/badge/5710/seandavi/ngCGH.png
   :target: http://dx.doi.org/10.5281/zenodo.11391

Overview
============
Next-generation sequencing of tumor/normal pairs provides a good opportunity to examine large-scale copy number variation in the tumor relative to the normal sample.  In practice, this concept seems to extend even to exome-capture sequencing of pairs of tumor and normal.  This library consists of a single script, ngCGH, that computes a pseudo-CGH using simple coverage counting on the tumor relative to the normal.

I have chosen to use a fixed number of reads in the normal sample as the "windowing" approach.  This has the advantage of producing copy number estimates that should have similar variance at each location.  The algorithm will adaptively deal with inhomogeneities across the genome such as those associated with exome-capture technologies (to the extent that the capture was similar in both tumor and normal).  The disadvantage is that the pseudo-probes will be at different locations for every "normal control" sample. 
 

Installation
=============
There are several possible ways to install ngCGH.  

github
-------
If you are a git user, then simply cloning the repository will get you the latest code.

::

  git clone git://github.com/seandavi/ngCGH.git

Alternatively, click the `Download `_ button and get the tarball or zip file.

In either case, change into the resulting directory and::

  cd ngCGH
  python setup.py install

From PyPi
-------------------
If you have easy_install in place, this should suffice for installation:

::

  easy_install ngCGH




Usage
=====
Usage is very simple:

::

    usage: ngCGH [-h] [-w WINDOWSIZE] [-o OUTFILE] [-l LOGLEVEL] [-r REGIONS]
		 [-t PROCESSES]
		 normalbam tumorbam

    positional arguments:
      normalbam             The name of the bamfile for the normal comparison
      tumorbam              The name of the tumor sample bamfile

    optional arguments:
      -h, --help            show this help message and exit
      -w WINDOWSIZE, --windowsize WINDOWSIZE
			    The number of reads captured from the normal sample
			    for calculation of copy number (default: 1000)
      -o OUTFILE, --outfile OUTFILE
			    Output filename, default  (default: None)
      -l LOGLEVEL, --loglevel LOGLEVEL
			    Logging Level, 1-30 with 1 being maximal logging and
			    30 being errors only [20] (default: 20)
      -r REGIONS, --regions REGIONS
			    regions to which analysis should be restricted, either
			    a bed file name or a single region in format chrN:XXX-
			    YYY (default: None)
      -t PROCESSES, --threads PROCESSES
			    parallelize over regions (or chromosomes) (default: 1)


Output
======
The output format is also very simple:

::

  chr1    4851    52735   1000    854     -0.025120
  chr1    52736   59251   1000    812     -0.097876
  chr1    59251   119119  1000    876     0.011575
  chr1    119120  707038  1000    1087    0.322924
  chr1    707040  711128  1000    1016    0.225472
  chr1    711128  711375  1000    1059    0.285275
  chr1    711375  735366  1000    919     0.080709
  chr1    735368  798455  1000    972     0.161600

Columns 1-3 describe the chromosome, start, and end for each pseudo-probe.  The fourth column is the number of reads in the normal sample in the window while the fifth column represents the reads *in the same genomic window* from the tumor.  The last column contains the median-centered log2 ratio between tumor and normal.


Convert from ngCGH to BioDiscovery Nexus
----------------------------------------
Included in the release is a script, convert2nexus, that takes as input the filename of a file created by ngCGH and converts it into a file that the Nexus CGH software from BioDiscovery can load for further analysis.  The format looks like this:

::

  Name    Chromosome      Start   End     PALZGU.cgh
  chr1_10004      chr1    10004   15735   -2.087921
  chr1_15736      chr1    15736   69385   -2.670936
  chr1_69386      chr1    69386   521687  -0.428244
  chr1_523537     chr1    523537  726959  0.080269
  chr1_726959     chr1    726959  808542  0.223047
  chr1_808546     chr1    808546  809138  -1.186761

I presented a webinar on using ngCGH with `BioDiscovery Nexus `_ that you can `view here `_.

.. note::

   The file format generated above can be loaded into Biodiscovery Nexus using the "Multi1" data type.


Convert from Complete Genomics to BioDiscovery Nexus
----------------------------------------------------
There is now plenty of Complete Genomics data floating around.  We are often interested in visualizing the somatic CNV data in Biodiscovery nexus.  There is a script, cgi2nexus that takes a file typically named as "SomaticCnvDetailsDiploidBeta*" and converts to the file format noted above.  Bzip2 (typical from CGI) are uncompressed on-the-fly.

Segmenting output
-------------------------
The cgh2seg script uses some sane defaults (at least for exomes) to the Circular Binary Segmentation algorithm as implemented in the DNAcopy Bioconductor package.  The segmented results are centered around the mode of the density of the segmented values on a per-probe basis.  The script will write the "Centrality parameter" to stderr when it completes.

The file format is:

:: 

  ID      chrom   loc.start       loc.end num.mark        seg.mean
  09      chr1    367695  82438842        2279    0.546541374526925
  09      chr1    82778033        93082545        206     0.077841374526925
  09      chr1    93205647        103965955       188     -0.913458625473075
  09      chr1    104000621       104166584       4       -0.216558625473075
  09      chr1    104342470       110014374       109     -0.948958625473075
  09      chr1    110024223       110058480       4       -1.38295862547308


Methods
============
The pseudo-cgh algorithm employed by ngCGH takes as input two appropriately matched BAM files, typically from a tumor and a matched normal.  Genomic windows are defined by reading blocks of a fixed number of reads (default 1000 reads) in the normal sample.  Within each defined genomic window, the number of reads in the tumor is quantified.  For each genomic window, a ratio is made between the number of reads in the tumor and the number of reads in the normal.  Finally, a log2 transformation is applied to each ratio and the entire vector of the results is then centered by subtracting the median.

Owner

Name: Sean Davis
Login: seandavi
Kind: user
Location: Bethesda, MD, 20892, USA
Company: National Cancer Institute, National Institutes of Health

Website: https://seandavi.github.io/
Repositories: 87
Profile: https://github.com/seandavi

- Pediatric oncologist - Cancer researcher - Data scientist - Community organizer

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Committers

Last synced: over 2 years ago

All Time

Total Commits: 47
Total Committers: 4
Avg Commits per committer: 11.75
Development Distribution Score (DDS): 0.447

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Sean Davis	s**i@g**m	26
Sean Davis	s**2@m**v	17
Josh Waterfall	j**l@y**m	3
Wolfgang Resch	w**h@g**m	1

Committer Domains (Top 20 + Academic)

mail.nih.gov: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 6
Total pull requests: 4
Average time to close issues: 22 minutes
Average time to close pull requests: about 1 hour
Total issue authors: 2
Total pull request authors: 3
Average comments per issue: 0.17
Average comments per pull request: 1.25
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

seandavi (5)
oalder (1)

Pull Request Authors

joshwaterfall (2)
biocyberman (1)
wresch (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 13 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 15
Total maintainers: 1

pypi.org: ngcgh

Pseudo-cgh of next-generation sequencing data

Homepage: http://github.com/seandavi/ngCGH
Documentation: https://ngcgh.readthedocs.io/
License: GPL-2
Latest release: 0.4.4
published about 12 years ago

Versions: 15
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 13 Last month

Rankings

Dependent packages count: 10.0%

Forks count: 10.9%

Stargazers count: 14.2%

Average: 18.2%

Dependent repos count: 21.7%

Downloads: 34.1%

Maintainers (1)

Sean.Davis

Last synced: 6 months ago

Dependencies

setup.py pypi

pysam >=0.3.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

ngcgh

Science Score: 33.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.rst

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: ngcgh

Rankings

Maintainers (1)

Dependencies