flo

Same species annotation lift over pipeline.

https://github.com/wurmlab/flo

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 10 DOI reference(s) in README
✓
Academic publication links
Links to: biorxiv.org, wiley.com, nature.com, mdpi.com
✓
Committers with academic emails
2 of 5 committers (40.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.1%) to scientific vocabulary

Keywords

bioinformatics gene-prediction gff liftover

Last synced: 6 months ago · JSON representation ·

Repository

Same species annotation lift over pipeline.

Basic Info

Host: GitHub
Owner: wurmlab
Language: Ruby
Default Branch: master
Homepage:
Size: 4.45 MB

Statistics

Stars: 98
Watchers: 15
Forks: 28
Open Issues: 11
Releases: 0

Topics

bioinformatics gene-prediction gff liftover

Created almost 11 years ago · Last pushed over 2 years ago

Metadata Files

Readme Citation

flo - basic gff annotations lift over using chain files

Lift over is a way of mapping annotations from one genome assembly to another. The idea "lift over" is same as what tools like UCSC LiftOver, NCBI's LiftUp web service do. However, NCBI and UCSC's web services are available only for a limited number of species.

To perform lift over locally, one can use UCSC chain files (Kent et al 2003) with programs such as UCSC's liftOver or CrossMap. A chain file captures large, homologous segments between two genomes as chains of gapless blocks of alignment. One way of generating chain files is using this bash script and UCSC tools.

flo is an implementation of the above script in Ruby programming language. Further, both liftOver and CrossMap process GFF files line by line instead of transcripts as a whole. This results in some non-biologically meaningful output. flo provides a basic filtering of UCSC liftOver's GFF output.

We created flo for our work on the fire ant genome. If you use flo, please cite the following paper:

The fire ant social chromosome supergene variant Sb shows low diversity but high divergence from SB. 2017. R Pracana, A Priyam, I Levantis, Y Wurm. Molecular Ecology, doi: 10.1111/mec.14054.

Using flo | Results & discussion | Tweaking flo

Using flo

To use flo you must have Ruby 2.0 or higher and the BioRuby gem. Ruby 2.0 can be installed through package managers on Linux and is available by default on Mac. To install BioRuby gem:

sudo gem install bio

flo additionally requires a few programs from UCSC tools, GNU Parallel and genometools. These can be installed in any directory by running 'scripts/install.sh' script after you have downloaded flo:

wget -c https://github.com/yeban/flo/archive/master.tar.gz -O flo.tar.gz
tar xvf flo.tar.gz
mv flo-master flo

It's best to run flo in a new directory - we will call it project dir:

mkdir flo_species_name
cd flo_species_name

Copy over example configuration file from where you installed flo to project dir:

cp /path/to/flo/opts_example.yaml flo_opts.yaml

Install flo's dependencies in ext/ directory in the project dir:

/path/to/flo/scripts/install.sh

Now edit opts.yaml to indicate: 1. Location of source and target assembly in FASTA format (required). 2. Location of GFF3 file(s) containing annotations on the source assembly. If this is omitted, flo will stop after generating the chain file. 3. BLAT parameters (optional). By default the target assembly is assumed to be of the same species. If the target assembly is a different (but closely related) species, you may want to lower minIdentity. 4. Number of CPU cores to use (required - not auto detected). This
cannot be greater than the number of scaffolds in the target assembly.

Here, it's important to note that flo can only work with transcripts and their child exons and CDS. Transcripts can be annotated as: mRNA, transcript, or gene. However, if you have a 'gene' annotation for each transcript, you will need to remove that:

/path/to/flo/gff_remove_feats.rb gene xx_genes.gff \
> xx_transcripts.gff

Alternatively, if you have more than one transcript annotated for each gene, you can select the longest transcript for each gene to work with:

/path/to/flo/gff_longest_transcripts.rb xx_genes.gff \
> xx_longest_transcripts.gff

Finally, run flo as:

rake -f /path/to/flo/Rakefile

A common problem encountered is that 1st column of GFF file doesn't match chromosome, or scaffold, or contig id in the source assembly. In this case liftOver will generate an empty output file. flo stops at this point. You can fix the GFF file and resume flo by running the above command.

flo writes all output to a directory called run/ in the current directory. The chain file generated by flo can be found at run/liftover.chn. If flo completed successfully, a directory is created for each given GFF3 file in 'run/' that contains: 1. lifted.gff3 and unlifted.gff3 - liftOver's output 2. lifted_cleaned.gff - lifted.gff3 cleaned by flo -> final output 3. unmapped.txt - id of all transcripts that were not lifted and whose coding sequence before and after lift are not identical. Non-identical coding sequences can be the result of SNPs and short indels between the samples used to construct source and target assembly; it could be due to sequencing error in the target assembly or annotation error in the source assembly, or it could be that the transcript mapped to a duplicated region. These transcripts are included in the final GFF, but their ids are also listed here to signal lower confidence due to the difficulty in separating true polymorphism from assembly errors and paralogous sequence variation.

Results & discussion

Both strengths and weaknesses of flo largely reflect that of the underlying tools - the chain file and UCSC liftOver. In general, gaps and errors in assemblies may split a long chain. Gene models that are split across different chains as well as those that are duplicated in the target assembly are not lifted.

Tweak flo

If you would like to optimise how chain files are created: - UCSC wiki and website is an amazing resource to learn about BLAT and chain files. Don't forget to read Kent 2003 paper cited above first. - Read the Rakefile from top to bottom. Ruby is similar, yet simpler compared to Perl and bash.

You can test things by lifting annotations between the same assembly.

Owner

Name: Yannick Wurm research lab @ QMUL & Turing
Login: wurmlab
Kind: organization
Location: Queen Mary University of London

Website: https://wurmlab.com
Twitter: yannick__
Repositories: 34
Profile: https://github.com/wurmlab

Ants, bees, evolutionary genomics & bioinformatics.

Citation (CITATION.cff)

# YAML 1.2
---
authors: 
  -
    family-names: Pracana
    given-names: Rodrigo
  -
    family-names: Priyam
    given-names: Anurag
  -
    family-names: Levantis
    given-names: Ilya
  -
    family-names: Nichols
    given-names: "Richard A"
  -
    family-names: Wurm
    given-names: Yannick
cff-version: "1.1.0"
date-released: 2017-02-21
doi: "10.1111/mec.14054"
keywords: 
  - "annotation lift over"
  - pipeline
message: "We created flo for our work on the fire ant genome. If you use flo, please cite it using these metadata."
repository-code: "https://github.com/wurmlab/flo"
title: "The fire ant social chromosome supergene variant Sb shows low diversity but high divergence from SB"
...

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

Committers

Last synced: 7 months ago

All Time

Total Commits: 48
Total Committers: 5
Avg Commits per committer: 9.6
Development Distribution Score (DDS): 0.083

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Anurag Priyam	a**m@q**k	44
cmatkhan	c**k@g**m	1
Yannick Wurm	y**m@q**k	1
Philipp Bayer	p**y@g**m	1
Npaffen	5****n	1

Committer Domains (Top 20 + Academic)

qmul.ac.uk: 2

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 41
Total pull requests: 5
Average time to close issues: 8 months
Average time to close pull requests: 1 day
Total issue authors: 27
Total pull request authors: 5
Average comments per issue: 2.76
Average comments per pull request: 0.6
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

splaisan (5)
mictadlo (4)
yeban (3)
olechnwin (3)
14zac2 (2)
photocyte (2)
yannickwurm (2)
El-Castor (1)
pgonzale60 (1)
yplee614 (1)
Npaffen (1)
Huangyizhong (1)
RNAseqer (1)
kbrevs (1)
matoller (1)

Pull Request Authors

philippbayer (1)
Npaffen (1)
photocyte (1)
lassancejm (1)
cmatKhan (1)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

flo

Science Score: 77.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

flo - basic gff annotations lift over using chain files

Using flo

Results & discussion

Tweak flo

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels