vcf2diploid

personal genome constructor

https://github.com/abyzovlab/vcf2diploid

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

personal genome constructor

Basic Info

Host: GitHub
Owner: abyzovlab
License: other
Language: Java
Default Branch: master
Homepage:
Size: 21.5 KB

Statistics

Stars: 9
Watchers: 3
Forks: 12
Open Issues: 6
Releases: 0

Created almost 12 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

README

A more generalized version of the tools has been recently developed. Please check it out at https://github.com/doron-st/vcf2diploid

README file for vcf2diploid tool distribution

1. Compilation
==============

$ make


2. Running
==========

java -jar vcf2diploid.jar -id sample_id -chr file.fa ... [-vcf file.vcf ...]

where sample_id is the ID of individual whose genome is being constructed
(e.g., NA12878), file.fa is FASTA file(s) with reference sequence(s), and
file.vcf is VCF4.0 file(s) with variants. One can specify multiple FASTA and
VCF files at a time. Splitting the whole genome in multiple files (e.g., with
one FASTA file per chromosome) reduces memory usage.
Amount of memory used by Java can be increased as follows

java -Xmx4000m -jar vcf2diploid.jar -id sample_id -chr file.fa ... [-vcf file.vcf ...]

You can try the program by running 'test_run.sh' script in the 'example'
directory. See also "Important notes" below.


3. Constructing personal annotation and splice-junction library
===============================================================

* Using chain file one can lift over annotation of reference genome to personal
haplotpes. This can be done with the liftOver tools
(see http://hgdownload.cse.ucsc.edu/admin/exe).

For example, to lift over Gencode annotation once can do

$ liftOver -gff ref_annotation.gtf mat.chain mat_annotation.gtf not_lifted.txt


* To construct personal splice-junction library(s) for RNAseq analysis one can
use RSEQtools (http://archive.gersteinlab.org/proj/rnaseq/rseqtools).


Important notes
===============

All characters between '>' and first white space in FASTA header are used
internally as chromosome/sequence names. For instance, for the header

>chr1 human

vcf2diploid will upload the corresponding sequence into the memory under the
name 'chr1'.
Chromosome/sequence names should be consistent between FASTA and VCF files but
omission of 'chr' at the beginning is allows, i.e. 'chr1' and '1' are treated as
the same name.

The output contains (file formats are described below):
1) FASTA files with sequences for each haplotype.
2) CHAIN files relating paternal/maternal haplotype to the reference genome.
3) MAP files with base correspondence between paternal-maternal-reference
sequences.

File formats:
* FASTA -- see http://www.ncbi.nlm.nih.gov/blast/fasta.shtml
* CHAIN -- http://genome.ucsc.edu/goldenPath/help/chain.html
* MAP file represents block with equivalent bases in all three haplotypes
(paternal, maternal and reference) by one record with indices of the first
bases in each haplotype. Non-equivalent bases are represented as separate
records with '0' for haplotypes having non-equivalent base (see
clarification below).

Pat Mat Ref           MAP format
X   X   X    ____
X   X   X        \
X   X   X         --> P1 M1 R1
X   X   -    -------> P4 M4  0
X   X   -       ,--->  0 M6 R4
-   X   X    --'  ,-> P6 M7 R5
X   X   X    ----' 
X   X   X     
X   X   X     




For question and comments contact Alexej Abyzov at abyzov.alexej@mayo.edu

Owner

Name: Abyzov lab
Login: abyzovlab
Kind: organization
Location: Rochester, MN

Website: http://abyzovlab.org
Repositories: 11
Profile: https://github.com/abyzovlab

Software packages developed in Abyzov's lab

Citation (CITATION)

Rozowsky J, Abyzov A, Wang J, Alves P, Raha D, Harmanci A, Leng J, Bjornson R, Kong Y, Kitabayashi N, Bhardwaj N, Rubin M, Snyder M, Gerstein M.

AlleleSeq: analysis of allele-specific expression and binding in a network framework.
Source

Mol Syst Biol. 2011 Aug 2;7:522. doi: 10.1038/msb.2011.54.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science