inferring-exon-and-intron-metadata-from-.gff-file

This pipeline infers exon functional annotations using coordinates from a .gff file and derives corresponding intron coordinates along with relevant annotations. Users can optionally specify the length of intronic fragments to extract. The pipeline is suited for splicing code analyses.

https://github.com/sanjanabhatnagar/inferring-exon-and-intron-metadata-from-.gff-file

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary

Keywords

exon-annotation exon-intron-structure gff gff-toolkit intron-annotation intron-classification intron-coordinates introns splicing splicing-analyses
Last synced: 6 months ago · JSON representation ·

Repository

This pipeline infers exon functional annotations using coordinates from a .gff file and derives corresponding intron coordinates along with relevant annotations. Users can optionally specify the length of intronic fragments to extract. The pipeline is suited for splicing code analyses.

Basic Info
  • Host: GitHub
  • Owner: sanjanabhatnagar
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 144 KB
Statistics
  • Stars: 2
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Topics
exon-annotation exon-intron-structure gff gff-toolkit intron-annotation intron-classification intron-coordinates introns splicing splicing-analyses
Created over 1 year ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

Inferring Exon and Intron Annotations from Genome Feature Files (.gff)

Overview

This pipeline extracts exon and intron metadata from .gff files using only coordinate information. It classifies exons into the following splicing categories:

  • Constitutively spliced exons (always included in mRNA)
  • Skipped exons (e.g., casette exons and mutually exclusive exons)
  • Alternative 5’ splice site (5’ss) exons (exons with two donor sites)
  • Alternative 3’ splice site (3’ss) exons (exons with two acceptor sites)
  • Retained introns (introns incorporated into mRNA)
  • Composite exons (exhibiting multiple alternative splicing types)

Additionally, it annotates the first and last exons per transcript for a given gene model.

Inferring Intron Coordinates

The .gff file does not include intron information. To address this: 1. AGAT software infers inter-exonic region coordinates. 2. gffexon-intronannotations.py filters actual introns and adds metadata using flanking exon annotations. 3. The crdntstobed.R script converts extracted intron coordinates to BED format.


Pipeline Steps

1. Download the .gff File

.gff files can be obtained from RefSeq or other genomic databases. They typically include gene structure elements (exons, UTRs, CDS, etc.) but lack intron coordinates.

2. Infer Intron Coordinates Using AGAT

AGAT can be installed and added to the system PATH. Use the following command: bash agat_sp_add_introns.pl --gff input.gff -o updated.gff This step generates an updated .gff file with inter-exonic regions.

3. Annotate Exons and Introns

Run the Python script to process the updated .gff file: bash python gff_exon-intron_annotations.py updated.gff keywords.txt metadata_Introns_annotated.tsv metadata_Introns_Exons_annotated.tsv intron_coordinates.tsv [absolute/relative] [size]

Parameters:

  • updated.gff: Output from AGAT containing inferred intron regions.
  • keywords.txt: List of metadata fields to extract (e.g., ID, gene_id, transcript_id). Since, information column in .gff for different organisms contains different data.
  • absolute/relative: Determines coordinate type. absolute outputs standard genomic coordinates, while relative scales them from 0 to gene length.
  • size: Defines intron fragment length (default = median intron size, rounded to the nearest 10 bp).

Intron Splitting Strategy:

  • Introns larger than the median size are split into two parts:
    • First fragment (start of the intron)
    • Last fragment (end of the intron)

Output Files

1. metadata_Introns_annotated.tsv

  • Contains actual intron coordinates and metadata.
  • If relative is chosen, absolute and relative start/end coordinates are included.
  • IDs for intron fragments are generated.

2. metadata_Introns_Exons_annotated.tsv

  • Includes metadata for both exons and introns.

3. intron_coordinates.tsv

  • Contains filtered intron coordinates, which can be converted to BED format.

Converting to BED Format

Use crdnts_to_bed.R to convert intron_coordinates.tsv to BED format: r Rscript crdnts_to_bed.R intron_coordinates.tsv This generates two BED files: - Positive strand (positive_strand.bed) - Negative strand (negative_strand.bed)

Adding Unique IDs

Use awk to append IDs: ```bash awk '{printf("%s+%d%d\n",$1,$2,$3); }' positivestrand.bed > positiveID.bed paste positivestrand.bed positiveID.bed > finalpositive.bed

awk '{printf("%s-%d%d\n",$1,$2,$3); }' negativestrand.bed > negativeID.bed paste negativestrand.bed negativeID.bed > finalnegative.bed After editing to replace spaces with tabs, the output should resemble: NC026501.1 3822 3894 NC026501.1+38223894 NC026501.1 4861 4961 NC026501.1+48614961 ...

```

Summary

This pipeline extracts exon and intron annotations from .gff files by inferring intron positions, filtering out conditional introns (only preserving introns that don't overlap with any exons in other transcripts), and generating BED files for downstream analysis. It supports metadata extraction, intron classification, and size-based intron fragmentation.

For any issues, refer to AGAT documentation or modify the scripts accordingly.

Cite

If you use this pipeline, please cite:

Bhatnagar, S., & Calarco, J. (2025). Inferring Exon and Intron Metadata from .gff file (v1.0.1). Zenodo. https://doi.org/10.5281/zenodo.15757714

References

Dainat J. AGAT: Another Gff Analysis Toolkit to handle annotations in any GTF/GFF format.
(Version v0.7.0). Zenodo. https://www.doi.org/10.5281/zenodo.3552717

Owner

  • Name: Sanjana Bhatnagar
  • Login: sanjanabhatnagar
  • Kind: user
  • Location: Toronto, Ontario
  • Company: University of Toronto

PhD candidate (Cell and Systems Biology)

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this pipeline, please cite it using the metadata below."
title: "Inferring Exon and Intron Metadata from .gff file"
authors:
  - family-names: Bhatnagar
    given-names: Sanjana
date-released: 2025-06-27
version: 1.0.0
repository-code: https://github.com/sanjanabhatnagar/Inferring-Exon-and-Intron-Metadata-from-.gff-file
doi: 10.5281/zenodo.15757210
type: software
license: Apache-2.0
keywords:
  - intron annotation
  - exon metadata
  - gff parser
  - bioinformatics pipeline

GitHub Events

Total
  • Push event: 3
  • Public event: 1
Last Year
  • Push event: 3
  • Public event: 1