https://github.com/alejandrogzi/bed2gff

cool BED-to-GFF3 converter that runs in parallel

https://github.com/alejandrogzi/bed2gff

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 1 committers (100.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.5%) to scientific vocabulary

Keywords

bed bioinformatics gene-annotation genome-annotation gff3
Last synced: 5 months ago · JSON representation

Repository

cool BED-to-GFF3 converter that runs in parallel

Basic Info
  • Host: GitHub
  • Owner: alejandrogzi
  • License: mit
  • Language: Rust
  • Default Branch: master
  • Homepage:
  • Size: 110 KB
Statistics
  • Stars: 9
  • Watchers: 1
  • Forks: 2
  • Open Issues: 2
  • Releases: 3
Topics
bed bioinformatics gene-annotation genome-annotation gff3
Created over 2 years ago · Last pushed 10 months ago
Metadata Files
Readme License

README.md

Crates.io GitHub Crates.io Total Downloads Conda Platform

bed2gff

A Rust BED-to-GFF3 parallel translator.

translates chr7 56766360 56805692 ENST00000581852.25 1000 + 56766360 56805692 0,0,200 3 3,135,81, 0,496,39251, into ``` chr7 bed2gff gene 56399404 56805892 . + . ID=ENSG00000166960;gene_id=ENSG00000166960

chr7 bed2gff transcript 56766361 56805692 . + . ID=ENST00000581852.25;Parent=ENSG00000166960;geneid=ENSG00000166960;transcriptid=ENST00000581852.25

chr7 bed2gff exon 56766361 56766363 . + . ID=exon:ENST00000581852.25.1;Parent=ENST00000581852.25;geneid=ENSG00000166960;transcriptid=ENST00000581852.25,exon_number=1

chr7 bed2gff CDS 56766361 56766363 . + 0 ID=CDS:ENST00000581852.25.1;Parent=ENST00000581852.25;geneid=ENSG00000166960;transcriptid=ENST00000581852.25,exon_number=1

...

chr7 bed2gff startcodon 56766361 56766363 . + 0 ID=startcodon:ENST00000581852.25.1;Parent=ENST00000581852.25;geneid=ENSG00000166960;transcriptid=ENST00000581852.25,exon_number=1

chr7 bed2gff stopcodon 56805690 56805692 . + 0 ID=stopcodon:ENST00000581852.25.3;Parent=ENST00000581852.25;geneid=ENSG00000166960;transcriptid=ENST00000581852.25,exon_number=3

... ```

in a few seconds.

Converts - Homo sapiens GRCh38 GENCODE 44 (252,835 transcripts) in 4.16 seconds. - Mus musculus GRCm39 GENCODE 44 (149,547 transcritps) in 2.15 seconds. - Canis lupus familiaris ROSCfam1.0 Ensembl 110 (55,335 transcripts) in 1.30 seconds. - Gallus gallus bGalGal1 Ensembl 110 (72,689 transcripts) in 1.51 seconds.

What's new on v.0.1.5

  • Adds --no-gene flag to only perform conversion without isoforms!
  • Modifies -i to be required unless --no-gene mode is present.
  • Refactors BedRecord.

Usage

``` text Usage: a) bed2gff[EXE] --bed --isoforms --output b) bed2gff[EXE] --bed --output --no-gene

Arguments: -b, --bed : a .bed file -i, --isoforms : a tab-delimited file -o, --output : path to output file -n, --no-gene : Flag to disable gene_id feature [default: false]

Options: --help: print help --version: print version --threads/-t: number of threads (default: max cpus) --gz: compress output .gtf ```

[!WARNING]

All the transcripts in .bed file should appear in the isoforms file.

crate: https://crates.io/crates/bed2gff

click for detailed formats

bed2gff just needs two files: 1. a .bed file tab-delimited files with 3 required and 9 optional fields: ``` chrom chromStart chromEnd name ... | | | | chr20 50222035 50222038 ENST00000595977 ... ``` see [BED format](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) for more information 2. a tab-delimited .txt/.tsv/.csv/... file with genes/isoforms (all the transcripts in .bed file should appear in the isoforms file): ``` > cat isoforms.txt ENSG00000198888 ENST00000361390 ENSG00000198763 ENST00000361453 ENSG00000198804 ENST00000361624 ENSG00000188868 ENST00000595977 ``` you can build a custom file for your preferred species using [Ensembl BioMart](https://www.ensembl.org/biomart/martview).

Installation

to install bed2gff on your system follow this steps: 1. get rust: curl https://sh.rustup.rs -sSf | sh on unix, or go here for other options 2. run cargo install bed2gff (make sure ~/.cargo/bin is in your $PATH before running it) 4. use bed2gff with the required arguments 5. enjoy!

Build

to build bed2gff from this repo, do:

  1. get rust (as described above)
  2. run git clone https://github.com/alejandrogzi/bed2gff.git && cd bed2gff
  3. run cargo run --release -- -b <BED> -i <ISOFORMS> -o <OUTPUT>

Container image

to build the development container image: 1. run git clone https://github.com/alejandrogzi/bed2gff.git && cd bed2gff 2. initialize docker with start docker or systemctl start docker 3. build the image docker image build --tag bed2gff . 4. run docker run --rm -v "[dir_where_your_gtf_is]:/dir" bed2gff -b /dir/<BED> -i /dir/<ISOFORMS> -o /dir/<OUTPUT>

Conda

to use bed2gff through Conda just: 1. conda install bed2gff -c bioconda or conda create -n bed2gff -c bioconda bed2gff

Output

bed2gff will send the output directly to the same .bed file path if you specify so

``` bed2gff annotation.bed isoforms.txt output.gff

. ├── ... ├── isoforms.txt ├── annotation.bed └── output.gff3 `` whereoutput.gff3` is the result.

FAQ

Why?

Converting formats is a daily practice in bioinformatics. This is way more common while working with gene annotations as tools differ in input/output layouts. GTF/GFF/BED are the most used structures to store gene-related annotations and the conversion needs are not well covered by available software.

A considerable portion of genomic tools reduce the software space by accepting GTF/GFF3 files only, directing BED users to translate their files into different formats. While some of this issues have already been covered (e.g. bed2gtf) with GTF files, the GFF3 layout lacks stable converting tools (1, 2).

bed2gff is presented as a straightforward option to convert BED files into ready-to-use GFF3 files, closing that gap.

How?

bed2gff, takes the base code of bed2gtf, that basically is the reimplementation of UCSC's C binaries merged in 1 step (bedToGenePred + genePredToGtf). This tool evaluates the position of exons and other features (CDS, stop/start, UTRs), preserving reading frames and adjusting the indexing count. The main approach now is a parallel algorithm that significantly reduces computation times.

Following the rationale of bed2gtf, bed2gff is able to produce a ready-to-use gff3 file by using an isoforms file, that works as the refTable in C binaries to map each transcript to their respective gene.

References

  1. https://bioinformatics.stackexchange.com/questions/2242/how-to-convert-bed-to-gff3
  2. https://www.biostars.org/p/2/

Owner

  • Name: Alejandro Gonzales-Irribarren
  • Login: alejandrogzi
  • Kind: user

GitHub Events

Total
  • Issues event: 1
  • Watch event: 1
  • Issue comment event: 8
  • Push event: 1
  • Pull request event: 2
  • Fork event: 1
Last Year
  • Issues event: 1
  • Watch event: 1
  • Issue comment event: 8
  • Push event: 1
  • Pull request event: 2
  • Fork event: 1

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 33
  • Total Committers: 1
  • Avg Commits per committer: 33.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 33
  • Committers: 1
  • Avg Commits per committer: 33.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
alejandrogzi j****1@u****e 33
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 4
  • Total pull requests: 3
  • Average time to close issues: 23 days
  • Average time to close pull requests: about 18 hours
  • Total issue authors: 4
  • Total pull request authors: 2
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.67
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 1
  • Average time to close issues: 7 days
  • Average time to close pull requests: 2 days
  • Issue authors: 3
  • Pull request authors: 1
  • Average comments per issue: 2.33
  • Average comments per pull request: 2.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • karlaarz (1)
  • SWei2333 (1)
  • sanyalab (1)
Pull Request Authors
  • alejandrogzi (2)
  • cmdcolin (2)
Top Labels
Issue Labels
question (1)
Pull Request Labels

Dependencies

Cargo.lock cargo
  • proc-macro2 1.0.67
  • quote 1.0.33
  • syn 2.0.37
  • thiserror 1.0.49
  • thiserror-impl 1.0.49
  • unicode-ident 1.0.12
Cargo.toml cargo