GFAKluge

GFAKluge: A C++ library and command line utilities for the Graphical Fragment Assembly formats - Published in JOSS (2019)

https://github.com/edawson/gfakluge

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org, zenodo.org
  • Committers with academic emails
    2 of 13 committers (15.4%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

genomics gfa graph-representation parsing

Keywords from Contributors

bioinformatics
Last synced: 6 months ago · JSON representation

Repository

A C++ library and utilities for manipulating the Graphical Fragment Assembly format.

Basic Info
Statistics
  • Stars: 55
  • Watchers: 3
  • Forks: 20
  • Open Issues: 22
  • Releases: 10
Topics
genomics gfa graph-representation parsing
Created about 10 years ago · Last pushed almost 4 years ago
Metadata Files
Readme License Codemeta

README.md

gfakluge

DOI status

Build Status

What is it?

GFAKluge is a C++ parser/writer and a set of command line utilities for manipulating GFA files. It parses GFA to a set of data structures that represent the encoded graph. You can use these components and their fields/members to build up your own graph representation. You can also convert between GFA 0.1 <-> 1.0 <-> 2.0 to glue programs that use different GFA versions together.

Homepage: https://github.com/edawson/gfakluge
License: MIT

Dependencies

A C++11 compliant compiler (we recommend GCC or clang)
OpenMP (via GCC or clang)
NB: GFAKluge cannot be compiled with Apple clang, as it does not include OpenMP.

Command line utilities

When make is run, the gfak binary is built in the top level directory. It offers the following subcommands:
+ gfak extract : transform the GFA segment lines to a FASTA file.
+ gfak fillseq : fill in the sequence field of S lines with placeholders using sequences from a FASTA file. + gfak diff : check if two GFA files are different (not very sophisticated at the moment)
+ gfak sort : change the line order of a GFA file so that lines proceed in Header -> Segment -> Link/Edge/Containment -> Path order.
+ gfak convert : convert between the different GFA specifications (e.g. GFA1 -> GFA2).
+ gfak stats : get the assembly stats of a GFA file (e.g. N50, L50)
+ gfak subset : extract a subgraph between two Segment IDs in a GFA file.
+ gfak ids : manually coordinate / increment the ID spaces of two graphs, so that they can be concatenated.
+ gfak merge : merge (i.e. concatenate) multiple GFA files. NB: Obliterates nodes with the same ID.

For CLI usage, run any of the above (including gfak with no subcommand) with no arguments or -h. To change specification version, most commands take the -S flag and a single double argument.

Example CLI Usage

Examples of various commands are included in the examples.md file.

C++ API

Examples of the C++ API are included in the interface.md file.

How do I build it?

The gfak utilities are available via homebrew: brew install brewsci/bio/gfakluge

Building GFAKluge from source requires OpenMP. This should be supported on Linux by default. On Apple Mac OS X, we recommend installing gcc:

brew install gcc@8 make CXX=g++-8
or
sudo port install gcc8 make

You can then build libgfakluge and the command line gfak utilities by typing make in the repo.
To use GFAKluge in your program, you'll need to add a few lines to your code. First, add the necessary include line to your C++ code:
#include "gfakluge.hpp"

Next, make sure that the library is on the proper system paths and compile line:

            g++ -o my_exe my_exe.cpp -L/path/to/gfakluge/ -lgfakluge

You should then be able to parse and manipulate gfa from your program:

                gg = GFAKluge();
                gg.parse_gfa_file(my_gfa_file); 

                cout << gg << endl;

Why gfak / gfakluge?

  • Simple command line utilities (no awk foo needed!)
  • High level C++ API for many graph manipulations.
  • Easy to build - no external dependencies; build with just a modern C++ compiler supporting C++11.
  • Easy to develop with - Backing library is mostly STL containers and a handful of structs.
  • Performance - gfakluge is fast and relies on standard STL containers and basic structs.

Internal Structures

Internally, lines of GFA are represented as structs with member variables that correspond to their defined fields. Here's the definition for a sequence line, for example:

            struct sequence_elem{
                std::string seq;
                std::string name;
                map<string, string> opt_fields;
                long id;
            };

The structs for contained elements, link elements, and alignment elements are very similar. These individual structs are then wrapped in a set of standard containers for easy access:

            map<std::string, std::string> header;
            map<string, sequence_elem> name_to_seq;
            map<std::string, vector<contained_elem> > seq_to_contained;
            map<std::string, vector<link_elem> > seq_to_link;
            map<string, vector<alignment_elem> > seq_to_alignment;

All of these structures can be accessed using the get_<Thing> method, where <Thing> is the name of the map you would like to retrieve. They reside in gfakluge.hpp.

GFA2

GFAKluge now supports GFA2! This brings with it four new structs: edge_elem, gap_elem, fragment_elem, and group_elem. They're contained in maps much like those for the GFA1 structs.

A few caveats apply:
1. As GFA2 is a superset of GFA1, we support only support legal GFA2 -> GFA1 conversions. Information can be lost along the way (e.g. unordered groups won't be output). 2. Our GFA2 testing is a bit limited but we've verified several times to be on-spec.

Tags we specifically do not (i.e. cannot) support in GFA2 -> GFA1 conversion: G - gap, U - unordered group, F - fragment. Links and containments should get converted to edges correctly. Sequence elements should get converted, but watch out for the length field if you hit issues.

GFAKluge is fully compliant with reading GFA2 and GFA0.1 <-> GFA1.0 -> GFA2.0 conversion as of September 2017.

Reading GFA

            GFAKluge gg;
            gg.parse_gfa_file("my_gfa.gfa");

You can then iterate over the aforementioned maps/structs and build out your own graph representation.

I'm working on a low-memory API for reading lines / emitting structs but it won't be this pretty.

Writing GFA

            GFAKluge og;

            sequence_elem s;
            s.sequence = "GATTACA";
            s.name = "seq1";
            og.add_sequence(s);

            sequence_elem t;
            t.sequence = "AATTGN";
            t.name = "seq2";
            og.add_sequence(t);

            link_elem l;
            l.source = s.name;
            l.sink = s.name;
            l.source_orientation_forward = true;
            l.sink_orientation_forward = true;
            l.pos = 0;
            l.cigar = "";

            og.add_link(l.source, l);

            cout << og << endl;
            ofstream f = ofstream("my_file.gfa);
            // Write GFA1
            f << og;

            // To convert to GFA2:
            og.set_version(2.0);
            f << od;

Status

  • GFAKluge is essentially a set of dumb containers - it does no error checking of your structure to detect if it is valid GFA. This may change as the GFA spec becomes more formal.
  • Diff is not a useful tool yet.
  • Parses JSON structs in optional fields of sequence lines (just as strings though).
  • Full GFA1/GFA2 compatibility and interconversion is now implemented.
  • CLI has been refactored to a single executable
  • Memory usage for to_string is a bit high - be careful with large graphs.
  • API for input / spec conversion / output is stable. API for merging graphs and coordinating ID namespaces may change slightly, but will strive for backwards compatibility.

Getting Help

Eric T Dawson
github: edawson
Please post an issue for help.

Contributing

GFAKluge is open-source and community contributions are welcome and appreciated! Please keep the following in mind when contributing to the repo:

  1. Please treat others with kindness and professionalism. Everyone is welcome and we will not tolerate harassment for any reason.
  2. Please keep gfakluge.hpp header-only and update the build process if a modification alters it.
  3. Please update the dependency list if one is added.
  4. Please use semantic versioning. Minor changes bump the third versioning digit (e.g. 1.0.0 -> 1.0.1).
    Additional features, or changes that may or may not partially break backward compatibility but which do not require significant modifications to code depending on the library bump the second versioning digit (e.g. 1.0.0 -> 1.1.0).
    Changes which signficantly alter the API require a bump in the major version digit (e.g. 1.0.0 -> 2.0.0).
  5. Please fully specify all namespace items (e.g. std::stream in place of just stream).
  6. To incorporate changes, please file a pull request on the Github page.
  7. Bug reports or feature requests should be posted as "issues" on the Github page with the appropriate tag and referenced in any relevant pull requests.

Owner

  • Name: Eric T. Dawson
  • Login: edawson
  • Kind: user
  • Location: Working From Home
  • Company: @nvidia

Bioinformatics Scientist (Genomics / AI) @nvidia

JOSS Publication

GFAKluge: A C++ library and command line utilities for the Graphical Fragment Assembly formats
Published
January 22, 2019
Volume 4, Issue 33, Page 1083
Authors
Eric T. Dawson ORCID
Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, MD, USA, Department of Genetics, University of Cambridge, Cambridge, UK, Wellcome Sanger Institute, Hinxton, UK
Richard Durbin ORCID
Department of Genetics, University of Cambridge, Cambridge, UK, Wellcome Sanger Institute, Hinxton, UK
Editor
Melissa Gymrek ORCID
Tags
GFA genome assembly bioinformatics

CodeMeta (codemeta.json)

{
  "@context": "https://raw.githubusercontent.com/codemeta/codemeta/master/codemeta.jsonld",
  "@type": "Code",
  "author": [
    {
      "@id": "0000-0001-5448-1653",
      "@type": "Person",
      "email": "eric.t.dawson@gmail.com",
      "name": "Eric T. Dawson",
      "affiliation": "Department of Genetics, University of Cambridge; Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health; Wellcome Sanger Institute"
    },
    {
      "@id": "0000-0002-9130-1006",
      "@type": "Person",
      "email": "rd109@cam.ac.uk",
      "name": "Richard Durbin",
      "affiliation": "Department of Genetics, University of Cambridge; Wellcome Sanger Institute"
    }
  ],
  "identifier": "doi.org/10.5281/zenodo.1434136",
  "codeRepository": "https://github.com/edawson/gfakluge",
  "datePublished": "2018-09-24",
  "dateModified": "2018-09-24",
  "dateCreated": "2018-09-24",
  "description": "A C++ library and utilities for manipulating the Graphical Fragment Assembly format.",
  "keywords": "genome assembly, GFA, C++, bioinformatics, FASTA",
  "license": "MIT",
  "title": "GFAKluge: A C++ library and command line utilities for the Graphical Fragment Assembly formats",
  "version": "1.0.0"
}

GitHub Events

Total
  • Watch event: 4
Last Year
  • Watch event: 4

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 350
  • Total Committers: 13
  • Avg Commits per committer: 26.923
  • Development Distribution Score (DDS): 0.109
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Eric T. Dawson e****n@g****m 312
Eric T Dawson e****n@u****u 9
Camille Scott c****w@g****m 5
Anton Korobeynikov a****n@k****o 5
Ali Ghaffaari a****i@g****m 4
LukasW94 l****t@u****e 3
Erik Garrison e****n@g****m 3
Adam Novak a****k@s****u 3
Shaun Jackman s****n@g****m 2
Michael R. Crusoe 1****c 1
Marco van Zwetselaar io@z****t 1
Hassan Nikaein n****n@g****m 1
Dr. K. D. Murray 1****9 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 43
  • Total pull requests: 30
  • Average time to close issues: 2 months
  • Average time to close pull requests: 30 days
  • Total issue authors: 13
  • Total pull request authors: 13
  • Average comments per issue: 2.42
  • Average comments per pull request: 0.77
  • Merged pull requests: 22
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • edawson (19)
  • sjackman (12)
  • LukasW94 (2)
  • cartoonist (1)
  • JuFengWang (1)
  • 8banzhuan (1)
  • adamnovak (1)
  • AndreaGuarracino (1)
  • jeizenga (1)
  • uveyikk (1)
  • tetron (1)
  • Zethson (1)
  • ekg (1)
Pull Request Authors
  • edawson (10)
  • adamnovak (4)
  • cartoonist (3)
  • sjackman (3)
  • jeizenga (2)
  • camillescott (1)
  • mr-c (1)
  • aafshinfard (1)
  • zwets (1)
  • ekg (1)
  • hnikaein (1)
  • kdm9 (1)
  • asl (1)
Top Labels
Issue Labels
enhancement (8) Feature Request (3) bug (2) docs (2)
Pull Request Labels