https://github.com/alejandrogzi/noel

GTF/GFF per gene non-overlapping exon length calculator

https://github.com/alejandrogzi/noel

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.1%) to scientific vocabulary

Keywords

exon gene-annotation gff gtf length non-overlapping
Last synced: 6 months ago · JSON representation

Repository

GTF/GFF per gene non-overlapping exon length calculator

Basic Info
  • Host: GitHub
  • Owner: alejandrogzi
  • License: mit
  • Language: Rust
  • Default Branch: master
  • Homepage:
  • Size: 734 KB
Statistics
  • Stars: 3
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
exon gene-annotation gff gtf length non-overlapping
Created over 2 years ago · Last pushed about 2 years ago
Metadata Files
License

https://github.com/alejandrogzi/noel/blob/master/

![version-badge](https://img.shields.io/badge/version-0.2.0-green)
![Crates.io](https://img.shields.io/crates/v/noel)
![GitHub](https://img.shields.io/github/license/alejandrogzi/noel?color=blue)


# noel

An extremely fast GTF/GFF per gene Non-Overlapping Exon Length calculator (noel) written in Rust.

Takes in a GTF/GFF file and outputs a .txt file with non-overlapping exon lengths. ## Usage ``` rust Usage: noel --i --o Arguments: --i : GTF/GFF file --o : .txt file Options: --help: print help --version: print version ``` #### crate: [https://crates.io/crates/noel](https://crates.io/crates/noel) ## Installation to install noel on your system follow this steps: 1. download rust: `curl https://sh.rustup.rs -sSf | sh` on unix, or go [here](https://www.rust-lang.org/tools/install) for other options 2. run `cargo install noel` (make sure `~/.cargo/bin` is in your `$PATH` before running it) 4. use `noel` with the required arguments ## Build to build noel from this repo, do: 1. get rust (as described above) 2. run `git clone https://github.com/alejandrogzi/noel.git && cd noel` 3. run `cargo run --release ` (arguments are positional, so you do not need to specify --i/--o) ## Library to include noel as a library and use it within your project follow these steps: 1. include `noel = 0.2.0` or `noel = "*"` under `[dependencies]` in your `Cargo.toml` file or just run `cargo add noel` from the command line 2. the library name is `noel`, to use it just write: ``` rust use noel::{noel, noel_reader}; ``` or ``` rust use noel::*; ``` 3. invoke ``` rust let exons: HashMap> = noel_reader(input: &PathBuf)? let lengths: Vec<(String, u32)> = noel(exons) ``` 4. you will end with a HashMap, where each gene name (gene_id) is a key to its length ```text [("ENSG00000261469": 533), ("ENSG00000150990": 6908), ("ENSG00000136490": 4751), ("ENSG00000290760": 801)] ``` ## Benchmark There are a handful amount of open-sourced tools/software/scripts to calculate non-overlapping exon lengths, namely: Kooi [1], Sun [2], and Slowikowski [3, 4] scripts, and gtftools (-l flag) [5]. The Non-Overlapping Exon Length calculator (NOEL; referred just as "noel"), is introduced as a novel tool that outperforms the aforementioned software due to its remarkable performance. To assess the efficiency of noel and test the capabilities of other available scripts/tools, I used run times and memory usage estimates, based on 5 consecutive runs. This evaluation focused on two major gene annotation formats: GTF and GFF. It is worth nothing, however, that only 3 tools are capable of handling GFF files: Slowikowski, Sun* (described below) and noel. Before any batch of runs, I first modified each script to be CLI-responsive. Additionally, I further edited Sun's script to be able to handle GFF inputs by changing a regex pattern. No performance enhance-related changes or breaking structural modifications were applied. Lastly, to evaluate the output consistency of the top-ranked tools (Sun, gtftools and noel), three species were used: *Homo sapiens* (GRCh38, GENCODE 44), *Canis lupus familiaris* (ROS_Cfam_1.0, Ensembl 110), and *Mus musculus* (GRCm39, GENCODE M33).

The diverse methodologies to calculate non-overlapping exon lengths led to noticeable differences in run times. While Kooi and Slowikowski scripts were the last ranked (>250s for GENCODE 44) with GTF files and Slowikowski only for GFF files (~300s for GENCODE 44); Sun, gtftools and noel were the most efficient options (<50s for GENCODE 44). When analyzing these top-ranked tools, it is quickly perceived the noel's dominance over its competitors. For GTF files, noel achieves noticeably faster computation times when compared to gtftools (x4.3 faster; 4.2s vs 17.9s) and to Sun's script (x10.9 speedup; 4.2s vs 45.7s). On the other hand, noel performs the calculations on GFF3 x12.6 times faster than Sun's script (3.9s vs 49.7s).

A similar pattern is seen when examining memory usage estimates based on GTF files. Three distinct groups of tools can be identified: high-memory-consuming tools (Sun, Slowikowski, and Kooi), tools with moderate memory usage (gtftools), and the most memory-efficient option (noel). Here, noel exhibited a significantly lower memory usage when compared to gtftools (x9.1 less; 42.9 Mb vs 391.8 Mb) and to Kooi (x73.1 less; 42.9 Mb vs 3.1 Gb). With GFF files, on the other hand, noel achieved a striking x146.1-fold reduction in memory usage compared to Slowikowski (62,700 genes).

The comparison of output from the top-ranked tools, including Sun, gtftools, and noel, yielded consistently paired estimates for each species, resulting in a high correlation (R = 0.99). Notably, both noel and Sun's script demonstrated a one-to-one correspondence for every gene in all tested annotation models. In contrast, gtftools exhibited limitations in processing genes, with a slight deficiency in the human and mouse models (0.05% and 0.06%, respectively), and a more substantial shortfall in the dog model (26%). Furthermore, noel outperformed the other tools, significantly improving runtime efficiency in both the mouse and dog models, with a speedup of at least 2.3 times. Based on this comparative analysis between existing scripts/software to calculate non-overlapping exonic lengths and noel, it is evident that this tool represents a significant improvement. These findings unveil the potential of noel as a valuable resource to provide a fast and efficient way to automate non-overlapping exon length calculations. ## References [1] https://www.biostars.org/p/83901/ [2] https://gist.github.com/jsun/aeca04ee2c5b5cc53ad795b660edd6c3 [3] https://gist.github.com/slowkow/8101481 [4] https://gist.github.com/slowkow/8101509#file-coding_lengths-py [5] Hong-Dong Li, Cui-Xiang Lin, Jiantao Zheng, GTFtools: a software package for analyzing various features of gene models, Bioinformatics, Volume 38, Issue 20, 15 October 2022, Pages 48064808, https://doi.org/10.1093/bioinformatics/btac561

Owner

  • Name: Alejandro Gonzales-Irribarren
  • Login: alejandrogzi
  • Kind: user

GitHub Events

Total
Last Year

Packages

  • Total packages: 1
  • Total downloads:
    • cargo 3,069 total
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 2
  • Total maintainers: 1
crates.io: noel

A GTF/GFF per gene non-overlapping exon length calculator

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 3,069 Total
Rankings
Dependent repos count: 30.7%
Dependent packages count: 36.2%
Average: 55.1%
Downloads: 98.4%
Maintainers (1)
Last synced: 6 months ago