https://github.com/adamtaranto/flo
Transfer annotations between genome assemblies using chained whole-genome alignment and UCSC LiftOver. Forked from wurmlab/flo.
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.3%) to scientific vocabulary
Last synced: 9 months ago
·
JSON representation
Repository
Transfer annotations between genome assemblies using chained whole-genome alignment and UCSC LiftOver. Forked from wurmlab/flo.
Basic Info
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of wurmlab/flo
Created almost 9 years ago
· Last pushed over 8 years ago
https://github.com/Adamtaranto/flo/blob/master/
# flo - basic gff annotations lift over using chain files
Lift over is a way of mapping annotations from one genome assembly to another.
The idea "lift over" is same as what tools like UCSC LiftOver, NCBI's LiftUp
web service do. However, NCBI and UCSC's webs services are available only for
a limited number of species.
To perform lift over locally, one can use UCSC chain files ([Kent et al 2003][kent2003])
with programs such as UCSC's liftOver or [CrossMap][crossmap]. A chain file captures
large, homologous segments between two genomes as chains of gapless blocks of
alignment. One way of generating chain files is using [this bash script][kent-script]
and [UCSC tools][ucsc-tools].
flo is an implementation of the above script in Ruby programming language. Further,
both liftOver and CrossMap process GFF files line by line instead of transcripts as
a whole. This results in some non-biologically meaningful output. flo provides a
basic filtering of UCSC liftOver's GFF output.
[Using flo](#using-flo) | [Results & discussion](#results--discussion) | [Tweaking flo](#tweak-flo)
## Using flo
To use flo you must have Ruby 2.0 or higher and the BioRuby gem.
Homebrew can be used on Mac to install the latest version of Ruby
and set it as default:
brew update
brew install ruby-build
brew install rbenv
rbenv install 2.4.2
rbenv global 2.4.2
To install the BioRuby gem:
sudo gem install bio
sudo gem install rake
flo additionally requires a few programs from [UCSC tools][ucsc-tools], [GNU
Parallel][gnu-parallel] and [genometools][genometools]. Clone flo from this
repository and run the mac or unix install script to get dependencies:
git clone https://github.com/Adamtaranto/flo.git && cd flo/scripts && ./install_mac.sh && cd ..
For mac use homebrew to install genometools and GNU-parallel:
brew install gt
brew install parallel
It's best to run flo in a new directory - we will call it flo_project:
mkdir flo_project
Copy over example configuration file from where you installed flo to
project dir:
cp flo_opts.yaml flo_project/project_opts.yaml
Now edit `flo_project/project_opts.yaml` to indicate:
1. Path to project directory
2. Locations of installed dependancies
3. Location of source and target assembly in FASTA format (required).
4. Number of CPU cores to use (required - not auto detected). Default '1'.
5. BLAT parameters (optional). By default the target assembly is
assumed to be of the same species. If the target assembly is
a different (but closely related) species, you may want to
lower `minIdentity`.
6. Location of GFF3 file(s) containing annotations on the source
assembly. If this is omitted, flo will stop after generating
the chain file.
Here, it's important to note that flo can only work with transcripts
and their child exons and CDS. Transcripts can be annotated as: mRNA,
transcript, or gene. However, if you have a 'gene' annotation for
each transcript, you will need to provide a cleaned gff file:
./gff_remove_feats.rb gene xx_genes.gff > xx_genes_cleaned.gff
*Note:* GFF files must contain appropriate headers and all
records must have *ID* field formatted as "ID="*someidstring*";" with
all rows ending in a ";" :
##gff-version 3
##sequence-region scaffold_001 1 2530722
scaffold_001 . gene 26031 29368 . - . ID=Gene_00004;
scaffold_001 . mRNA 26031 29368 . - . ID=RNA_00004;Parent=Gene_00004;
Ensure proper sorting and formatting by pre-processing your annotation file
with genometools:
gt gff3 -sortlines yes -retainids yes -tidy yes -fixregionboundaries yes -addids xx_genes_cleaned.gff > xx_genes_cleaned_sorted.gff
Finally, run flo as:
rake -f runflo.rake FLO_OPTS=flo_project/project_opts.yaml
This will generate a directory called 'run_[timestamp]/' within the project directory.
A subdirectory is created within 'run_[timestamp]/' for each GFF3 input file and contains:
1. `lifted.gff3` and `unlifted.gff3` - liftOver's output
2. `lifted_cleaned.gff` - lifted.gff3 cleaned by flo -> final output
3. `unmapped.txt` - id of all transcripts that couldn't be lifted.
Transcripts present in this list and also found in output gff
should be considered partial.
The chain file generated by flo can be found at 'run_[timestamp]/liftover.chn'.
## Results & discussion
Both strengths and weaknesses of flo largely reflect that of the underlying
tools - the chain file and UCSC liftOver. In general, gaps and errors in
assemblies may split a long chain. Gene models that are split across
different chains as well as those that are duplicated in the target
assembly are not lifted.
- For an ant genome (~350 Mb) we saw 90% annotations map identically to
the new assembly (unpublished result).
- flo has been used in [First draft assembly and annotation of the
genome of a California Endemic Oak _Quercus lobata_ Ne (Fagaceae).
Sork et al 2016. G3: Genes, Genomes, Genetics 6(11): 3485-3495.](https://doi.org/10.1534/g3.116.030411)
## Tweak flo
If you would like to optimise how chain files are created:
- UCSC wiki and website is an amazing resource to learn about BLAT and
chain files. Don't forget to read Kent 2003 paper cited above first.
- Read the `Rakefile` from top to bottom. Ruby is similar, yet simpler
compared to Perl and bash.
You can test things by lifting annotations between the same assembly.
---
Copyright 2017 Anurag Priyam, Queen Mary University of London
[kent-script]: http://hgwdev.cse.ucsc.edu/~kent/src/unzipped/hg/doc/liftOver.txt
[kent2003]: http://www.pnas.org/content/100/20/11484.full
[ucsc-tools]: http://hgdownload.cse.ucsc.edu/admin/exe/
[gnu-parallel]: https://www.gnu.org/software/parallel/
[genometools]: http://genometools.org/
[crossmap]: http://crossmap.sourceforge.net/
Owner
- Name: Adam Taranto
- Login: Adamtaranto
- Kind: user
- Location: Melbourne, Australia
- Company: The University of Melbourne
- Repositories: 38
- Profile: https://github.com/Adamtaranto