vcf2diploid

personal genome constructor

https://github.com/abyzovlab/vcf2diploid

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.1%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

personal genome constructor

Basic Info
  • Host: GitHub
  • Owner: abyzovlab
  • License: other
  • Language: Java
  • Default Branch: master
  • Homepage:
  • Size: 21.5 KB
Statistics
  • Stars: 9
  • Watchers: 3
  • Forks: 12
  • Open Issues: 6
  • Releases: 0
Created over 11 years ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README

A more generalized version of the tools has been recently developed. Please check it out at https://github.com/doron-st/vcf2diploid

README file for vcf2diploid tool distribution

1. Compilation
==============

$ make


2. Running
==========

java -jar vcf2diploid.jar -id sample_id -chr file.fa ... [-vcf file.vcf ...]

where sample_id is the ID of individual whose genome is being constructed
(e.g., NA12878), file.fa is FASTA file(s) with reference sequence(s), and
file.vcf is VCF4.0 file(s) with variants. One can specify multiple FASTA and
VCF files at a time. Splitting the whole genome in multiple files (e.g., with
one FASTA file per chromosome) reduces memory usage.
Amount of memory used by Java can be increased as follows

java -Xmx4000m -jar vcf2diploid.jar -id sample_id -chr file.fa ... [-vcf file.vcf ...]

You can try the program by running 'test_run.sh' script in the 'example'
directory. See also "Important notes" below.


3. Constructing personal annotation and splice-junction library
===============================================================

* Using chain file one can lift over annotation of reference genome to personal
haplotpes. This can be done with the liftOver tools
(see http://hgdownload.cse.ucsc.edu/admin/exe).

For example, to lift over Gencode annotation once can do

$ liftOver -gff ref_annotation.gtf mat.chain mat_annotation.gtf not_lifted.txt


* To construct personal splice-junction library(s) for RNAseq analysis one can
use RSEQtools (http://archive.gersteinlab.org/proj/rnaseq/rseqtools).


Important notes
===============

All characters between '>' and first white space in FASTA header are used
internally as chromosome/sequence names. For instance, for the header

>chr1 human

vcf2diploid will upload the corresponding sequence into the memory under the
name 'chr1'.
Chromosome/sequence names should be consistent between FASTA and VCF files but
omission of 'chr' at the beginning is allows, i.e. 'chr1' and '1' are treated as
the same name.

The output contains (file formats are described below):
1) FASTA files with sequences for each haplotype.
2) CHAIN files relating paternal/maternal haplotype to the reference genome.
3) MAP files with base correspondence between paternal-maternal-reference
sequences.

File formats:
* FASTA -- see http://www.ncbi.nlm.nih.gov/blast/fasta.shtml
* CHAIN -- http://genome.ucsc.edu/goldenPath/help/chain.html
* MAP file represents block with equivalent bases in all three haplotypes
(paternal, maternal and reference) by one record with indices of the first
bases in each haplotype. Non-equivalent bases are represented as separate
records with '0' for haplotypes having non-equivalent base (see
clarification below).

Pat Mat Ref           MAP format
X   X   X    ____
X   X   X        \
X   X   X         --> P1 M1 R1
X   X   -    -------> P4 M4  0
X   X   -       ,--->  0 M6 R4
-   X   X    --'  ,-> P6 M7 R5
X   X   X    ----' 
X   X   X     
X   X   X     




For question and comments contact Alexej Abyzov at abyzov.alexej@mayo.edu

Owner

  • Name: Abyzov lab
  • Login: abyzovlab
  • Kind: organization
  • Location: Rochester, MN

Software packages developed in Abyzov's lab

Citation (CITATION)

Rozowsky J, Abyzov A, Wang J, Alves P, Raha D, Harmanci A, Leng J, Bjornson R, Kong Y, Kitabayashi N, Bhardwaj N, Rubin M, Snyder M, Gerstein M.

AlleleSeq: analysis of allele-specific expression and binding in a network framework.
Source

Mol Syst Biol. 2011 Aug 2;7:522. doi: 10.1038/msb.2011.54.

GitHub Events

Total
  • Push event: 1
  • Fork event: 1
Last Year
  • Push event: 1
  • Fork event: 1