https://github.com/adamtaranto/flo

Transfer annotations between genome assemblies using chained whole-genome alignment and UCSC LiftOver. Forked from wurmlab/flo.

https://github.com/adamtaranto/flo

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.3%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Transfer annotations between genome assemblies using chained whole-genome alignment and UCSC LiftOver. Forked from wurmlab/flo.

Basic Info
  • Host: GitHub
  • Owner: Adamtaranto
  • Language: Ruby
  • Default Branch: master
  • Homepage:
  • Size: 4.43 MB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Fork of wurmlab/flo
Created almost 9 years ago · Last pushed over 8 years ago

https://github.com/Adamtaranto/flo/blob/master/

# flo - basic gff annotations lift over using chain files

Lift over is a way of mapping annotations from one genome assembly to another.
The idea "lift over" is same as what tools like UCSC LiftOver, NCBI's LiftUp
web service do. However, NCBI and UCSC's webs services are available only for
a limited number of species.

To perform lift over locally, one can use UCSC chain files ([Kent et al 2003][kent2003])
with programs such as UCSC's liftOver or [CrossMap][crossmap]. A chain file captures
large, homologous segments between two genomes as chains of gapless blocks of
alignment. One way of generating chain files is using [this bash script][kent-script]
and [UCSC tools][ucsc-tools].

flo is an implementation of the above script in Ruby programming language. Further,
both liftOver and CrossMap process GFF files line by line instead of transcripts as
a whole. This results in some non-biologically meaningful output. flo provides a
basic filtering of UCSC liftOver's GFF output.

[Using flo](#using-flo) | [Results & discussion](#results--discussion) | [Tweaking flo](#tweak-flo)

## Using flo

To use flo you must have Ruby 2.0 or higher and the BioRuby gem.
Homebrew can be used on Mac to install the latest version of Ruby
and set it as default:  

    brew update
    brew install ruby-build
    brew install rbenv
    rbenv install 2.4.2
    rbenv global 2.4.2

To install the BioRuby gem:

    sudo gem install bio
    sudo gem install rake

flo additionally requires a few programs from [UCSC tools][ucsc-tools], [GNU
Parallel][gnu-parallel] and [genometools][genometools]. Clone flo from this 
repository and run the mac or unix install script to get dependencies:

    git clone https://github.com/Adamtaranto/flo.git && cd flo/scripts && ./install_mac.sh && cd ..

For mac use homebrew to install genometools and GNU-parallel:

    brew install gt
    brew install parallel 

It's best to run flo in a new directory - we will call it flo_project:

    mkdir flo_project

Copy over example configuration file from where you installed flo to
project dir:

    cp flo_opts.yaml flo_project/project_opts.yaml

Now edit `flo_project/project_opts.yaml` to indicate:

1. Path to project directory
2. Locations of installed dependancies
3. Location of source and target assembly in FASTA format (required).
4. Number of CPU cores to use (required - not auto detected). Default '1'.
5. BLAT parameters (optional). By default the target assembly is
   assumed to be of the same species. If the target assembly is
   a different (but closely related) species, you may want to
   lower `minIdentity`.
6. Location of GFF3 file(s) containing annotations on the source
   assembly. If this is omitted, flo will stop after generating
   the chain file.

Here, it's important to note that flo can only work with transcripts
and their child exons and CDS. Transcripts can be annotated as: mRNA,
transcript, or gene. However, if you have a 'gene' annotation for
each transcript, you will need to provide a cleaned gff file:

    ./gff_remove_feats.rb gene xx_genes.gff > xx_genes_cleaned.gff

*Note:* GFF files must contain appropriate headers and all 
records must have *ID* field formatted as "ID="*someidstring*";" with 
all rows ending in a ";" :  

    ##gff-version   3
    ##sequence-region   scaffold_001 1 2530722
    scaffold_001  . gene  26031 29368 . - . ID=Gene_00004;
    scaffold_001  . mRNA  26031 29368 . - . ID=RNA_00004;Parent=Gene_00004;

Ensure proper sorting and formatting by pre-processing your annotation file 
with genometools: 

    gt gff3 -sortlines yes -retainids yes -tidy yes -fixregionboundaries yes -addids xx_genes_cleaned.gff > xx_genes_cleaned_sorted.gff

Finally, run flo as:

    rake -f runflo.rake FLO_OPTS=flo_project/project_opts.yaml

This will generate a directory called 'run_[timestamp]/' within the project directory. 
A subdirectory is created within 'run_[timestamp]/' for each GFF3 input file and contains:
1. `lifted.gff3` and `unlifted.gff3` - liftOver's output
2. `lifted_cleaned.gff` - lifted.gff3 cleaned by flo -> final output
3. `unmapped.txt` - id of all transcripts that couldn't be lifted.
   Transcripts present in this list and also found in output gff
   should be considered partial.

The chain file generated by flo can be found at 'run_[timestamp]/liftover.chn'.

## Results & discussion
Both strengths and weaknesses of flo largely reflect that of the underlying
tools - the chain file and UCSC liftOver. In general, gaps and errors in
assemblies may split a long chain. Gene models that are split across
different chains as well as those that are duplicated in the target
assembly are not lifted.

- For an ant genome (~350 Mb) we saw 90% annotations map identically to
the new assembly (unpublished result).
- flo has been used in [First draft assembly and annotation of the
genome of a California Endemic Oak _Quercus lobata_ Ne (Fagaceae).
Sork et al 2016. G3: Genes, Genomes, Genetics 6(11): 3485-3495.](https://doi.org/10.1534/g3.116.030411)

## Tweak flo
If you would like to optimise how chain files are created:
- UCSC wiki and website is an amazing resource to learn about BLAT and
  chain files. Don't forget to read Kent 2003 paper cited above first.
- Read the `Rakefile` from top to bottom. Ruby is similar, yet simpler
  compared to Perl and bash.

You can test things by lifting annotations between the same assembly.

---
Copyright 2017 Anurag Priyam, Queen Mary University of London

[kent-script]: http://hgwdev.cse.ucsc.edu/~kent/src/unzipped/hg/doc/liftOver.txt
[kent2003]: http://www.pnas.org/content/100/20/11484.full
[ucsc-tools]: http://hgdownload.cse.ucsc.edu/admin/exe/
[gnu-parallel]: https://www.gnu.org/software/parallel/
[genometools]: http://genometools.org/
[crossmap]: http://crossmap.sourceforge.net/

Owner

  • Name: Adam Taranto
  • Login: Adamtaranto
  • Kind: user
  • Location: Melbourne, Australia
  • Company: The University of Melbourne

GitHub Events

Total
Last Year