https://github.com/danforthcenter/falcon2fastg

Falcon2Fastg is a tool for converting a FALCON assembly to FASTG format to visualize with Bandage

https://github.com/danforthcenter/falcon2fastg

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    2 of 7 committers (28.6%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.3%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Falcon2Fastg is a tool for converting a FALCON assembly to FASTG format to visualize with Bandage

Basic Info
  • Host: GitHub
  • Owner: danforthcenter
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 2.4 MB
Statistics
  • Stars: 0
  • Watchers: 4
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Fork of md5sam/Falcon2Fastg
Created over 8 years ago · Last pushed over 8 years ago
Metadata Files
Readme License

README.md

Falcon2Fastg

This software converts the results of PacBio assembly using FALCON, to a FASTG graph that can be visualized using Bandage.

Usage

python Falcon2Fastg.py [--only-output=reads|contigs]

This can be run in the output directory of FALCON assembly (2-asm-falcon). Please make sure to copy the preads4falcon.fasta file from the intermediate directory (1-preads_ovl) to the output directory (2-asm-falcon)

Falcon2Fastg needs the following 6 input files:

  • preads4falcon.fasta

  • sgedgeslist

  • utg_data (if --only-output is unset, or set to contigs)

  • ctg_paths (if --only-output is unset, or set to contigs)

  • p_ctg.fa (if --only-output is unset, or set to contigs)

  • pctgtiling_path (if --only-output is unset, or set to contigs)

Dependencies :

Biopython (available at http://biopython.org/wiki/Download)

pyfaidx (available at https://github.com/mdshw5/pyfaidx)

Quick installation of dependencies:

pip install biopython pyfaidx  # add --user if you don't have root

Output :

The output of the tool is two FASTG files (reads.fastg and contigs.fastg) that can be opened with Bandage.

Additionally, the tool produces a CSV file : ReadsInContigs.csv that can be loaded with Bandage. This labels the reads according to the contigs that they are a part of, along with the mapping position within the contig.

Alt text

Above is a sample Bandage visualization of a reads.fastg file generated by Falcon2Fastg from a FALCON assembly (a plant mitochondrial genome).

  • Each node is a read, and each node is represented as a colored strip (colors are random)
  • Edges represent the overlaps between reads found by FALCON (better viewed in the zoomed-in image below)
  • Only the edges used in the string graph ("G" flagged in sgedgeslist) are used by Falcon2Fastg to produce the output file.

Zooming in on a smaller set of nodes shows the edges in black, connecting the colored nodes :

Alt text

For benchmarking, Falcon2Fastg was run on the preads4falcon.fasta and sgedgeslist file produced by the E.coli test dataset provided with the Falcon install. Instructions on obtaining the dataset are here : https://github.com/PacificBiosciences/FALCON/wiki/Setup:-Complete-example

Execution of Falcon2Fastg took 2 minutes on a desktop computer (size of preads4falcon.fasta: 449 MB).

The figure below represents a visualization of this E. coli data.

Alt text

Contigs visualization

Falcon2Fastg can also be used to visualize the contigs produced by FALCON, and overlaps between them. The contig graph is created in contigs.fastg. By default, Falcon2Fastg will output this file. You can choose that it outputs only the reads graph using the --only-output=reads parameter.

To test this visualization mode, we assembled Drosophila melanogaster reads available at:
https://github.com/PacificBiosciences/DevNet/wiki/Drosophila-sequence-and-assembly

The input file was 2.2G in size (dmelFALCONpreassembled_reads.fasta).

FALCON assembly parameters were not optimized, and were as follows :

lengthcutoff = 3000, lengthcutoffpr = 6000, overlapfilteringsetting = --maxdiff 100 --maxcov 100 --mincov 20

The final p_ctgs.fa file had 642 contigs with total length ~27 Mbp.

Execution of Falcon2Fastg took 5 minutes on a desktop computer (size of preads4falcon.fasta: 2.2 GB).

The figure below is the visualization of these D. mel. contigs (colors are random)

Alt text

Read density (approximate read coverage)

Bandage provides a way to visualize k-mer coverage, as reported by the assembler. As Falcon is a string graph assembler, it does not report such information. Ideally, to compute the coverage of a contig, one would need to re-map the reads back to the assembled contigs. Here, we report a more simple metric that is easy to compute from the output of Falcon.

Read density is calculated as (sum of length of all reads used by FALCON to construct the contig / length of contig). We believe that variation in read density reflects variation of coverage;

The figure below is a schematic of read density. The blue arrows represent reads that were used by Falcon to create the red (resp. black) contig. The contig above (black) has fewer reads within it. Its read density is around 2.0 The contig below (red) and has more reads within it. Its read density is around 5.0

Alt text

The figure below is the visualization of the same D. mel. contigs, colored by read density.

Alt text

Zooming in shows that bright red represents higher density (6.0x). Contigs colored black have a lower read density (2.0x)

Alt text

Memory Warning

The pyfaidx module is used to read an entire FASTA file into memory. If the size of your preads4falcon.fasta is greater than the amount of available RAM, it is advisable to run this computation on a server with greater available memory.

Caveats :

  • Reads within "contained" unitigs are not used in the calculation of Read density.

  • Read density is calculated by dividing total length of all reads in the contig by length of each contig (obtained from ctgpaths). Depending on the orientation, Falcon ignores either the first read or the last read while reporting a contig. Due to this, in the contigs.fastg file, the forward and revcomp entries might have different read_densities and different lengths.

Any large differences are mostly restricted to short contigs, when one very long read at either extremity can affect the length of the contig.

  • Read density is set to "1" for entries in reads.fastg, as this measure is only relevant for contigs.fastg

Testing :

Please see the test/ directory for a small example dataset and output

FALCON can be installed following the instructions here : https://github.com/PacificBiosciences/FALCON/wiki/Setup:-Complete-example

Other tools

Additional tools for visualizing read overlap can be found in the utils directory. Please consult utils/README.md for details

License

This content is released under MIT License. Please see LICENSE.md for details.

Authors

Primary author : Samarth Rangavittal, The Pennsylvania State University (szr165@psu.edu)

Rayan Chikhi, University of Lille 1

Jean-Stéphane Varré, University of Lille 1

Owner

  • Name: Donald Danforth Plant Science Center
  • Login: danforthcenter
  • Kind: organization
  • Location: St. Louis, MO

Our Mission: Improve the Human Condition Through Plant Science

GitHub Events

Total
Last Year

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 86
  • Total Committers: 7
  • Avg Commits per committer: 12.286
  • Development Distribution Score (DDS): 0.616
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
samarth s****h@b****r 33
Samarth Rangavittal s****5@b****u 23
Rayan Chikhi r****i@e****g 12
Samarth Rangavittal s****h@S****l 10
Samarth Rangavittal s****l@g****m 4
afinit m****1@g****m 3
Samarth Rangavittal s****h@c****u 1

Issues and Pull Requests

Last synced: over 2 years ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels