inferring-exon-and-intron-metadata-from-.gff-file
This pipeline infers exon functional annotations using coordinates from a .gff file and derives corresponding intron coordinates along with relevant annotations. Users can optionally specify the length of intronic fragments to extract. The pipeline is suited for splicing code analyses.
https://github.com/sanjanabhatnagar/inferring-exon-and-intron-metadata-from-.gff-file
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary
Keywords
Repository
This pipeline infers exon functional annotations using coordinates from a .gff file and derives corresponding intron coordinates along with relevant annotations. Users can optionally specify the length of intronic fragments to extract. The pipeline is suited for splicing code analyses.
Basic Info
Statistics
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 2
Topics
Metadata Files
README.md
Inferring Exon and Intron Annotations from Genome Feature Files (.gff)
Overview
This pipeline extracts exon and intron metadata from .gff files using only coordinate information. It classifies exons into the following splicing categories:
- Constitutively spliced exons (always included in mRNA)
- Skipped exons (e.g., casette exons and mutually exclusive exons)
- Alternative 5’ splice site (5’ss) exons (exons with two donor sites)
- Alternative 3’ splice site (3’ss) exons (exons with two acceptor sites)
- Retained introns (introns incorporated into mRNA)
- Composite exons (exhibiting multiple alternative splicing types)
Additionally, it annotates the first and last exons per transcript for a given gene model.
Inferring Intron Coordinates
The .gff file does not include intron information. To address this: 1. AGAT software infers inter-exonic region coordinates. 2. gffexon-intronannotations.py filters actual introns and adds metadata using flanking exon annotations. 3. The crdntstobed.R script converts extracted intron coordinates to BED format.
Pipeline Steps
1. Download the .gff File
.gff files can be obtained from RefSeq or other genomic databases. They typically include gene structure elements (exons, UTRs, CDS, etc.) but lack intron coordinates.
2. Infer Intron Coordinates Using AGAT
AGAT can be installed and added to the system PATH. Use the following command:
bash
agat_sp_add_introns.pl --gff input.gff -o updated.gff
This step generates an updated .gff file with inter-exonic regions.
3. Annotate Exons and Introns
Run the Python script to process the updated .gff file:
bash
python gff_exon-intron_annotations.py updated.gff keywords.txt metadata_Introns_annotated.tsv metadata_Introns_Exons_annotated.tsv intron_coordinates.tsv [absolute/relative] [size]
Parameters:
- updated.gff: Output from AGAT containing inferred intron regions.
- keywords.txt: List of metadata fields to extract (e.g.,
ID,gene_id,transcript_id). Since, information column in .gff for different organisms contains different data. - absolute/relative: Determines coordinate type.
absoluteoutputs standard genomic coordinates, whilerelativescales them from 0 to gene length. - size: Defines intron fragment length (default = median intron size, rounded to the nearest 10 bp).
Intron Splitting Strategy:
- Introns larger than the median size are split into two parts:
- First fragment (start of the intron)
- Last fragment (end of the intron)
Output Files
1. metadata_Introns_annotated.tsv
- Contains actual intron coordinates and metadata.
- If
relativeis chosen, absolute and relative start/end coordinates are included. - IDs for intron fragments are generated.
2. metadata_Introns_Exons_annotated.tsv
- Includes metadata for both exons and introns.
3. intron_coordinates.tsv
- Contains filtered intron coordinates, which can be converted to BED format.
Converting to BED Format
Use crdnts_to_bed.R to convert intron_coordinates.tsv to BED format:
r
Rscript crdnts_to_bed.R intron_coordinates.tsv
This generates two BED files:
- Positive strand (positive_strand.bed)
- Negative strand (negative_strand.bed)
Adding Unique IDs
Use awk to append IDs:
```bash
awk '{printf("%s+%d%d\n",$1,$2,$3); }' positivestrand.bed > positiveID.bed
paste positivestrand.bed positiveID.bed > finalpositive.bed
awk '{printf("%s-%d%d\n",$1,$2,$3); }' negativestrand.bed > negativeID.bed
paste negativestrand.bed negativeID.bed > finalnegative.bed
After editing to replace spaces with tabs, the output should resemble:
NC026501.1 3822 3894 NC026501.1+38223894
NC026501.1 4861 4961 NC026501.1+48614961
...
```
Summary
This pipeline extracts exon and intron annotations from .gff files by inferring intron positions, filtering out conditional introns (only preserving introns that don't overlap with any exons in other transcripts), and generating BED files for downstream analysis. It supports metadata extraction, intron classification, and size-based intron fragmentation.
For any issues, refer to AGAT documentation or modify the scripts accordingly.
Cite
If you use this pipeline, please cite:
Bhatnagar, S., & Calarco, J. (2025). Inferring Exon and Intron Metadata from .gff file (v1.0.1). Zenodo. https://doi.org/10.5281/zenodo.15757714
References
Dainat J. AGAT: Another Gff Analysis Toolkit to handle annotations in any GTF/GFF format.
(Version v0.7.0). Zenodo. https://www.doi.org/10.5281/zenodo.3552717
Owner
- Name: Sanjana Bhatnagar
- Login: sanjanabhatnagar
- Kind: user
- Location: Toronto, Ontario
- Company: University of Toronto
- Repositories: 1
- Profile: https://github.com/sanjanabhatnagar
PhD candidate (Cell and Systems Biology)
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this pipeline, please cite it using the metadata below."
title: "Inferring Exon and Intron Metadata from .gff file"
authors:
- family-names: Bhatnagar
given-names: Sanjana
date-released: 2025-06-27
version: 1.0.0
repository-code: https://github.com/sanjanabhatnagar/Inferring-Exon-and-Intron-Metadata-from-.gff-file
doi: 10.5281/zenodo.15757210
type: software
license: Apache-2.0
keywords:
- intron annotation
- exon metadata
- gff parser
- bioinformatics pipeline
GitHub Events
Total
- Push event: 3
- Public event: 1
Last Year
- Push event: 3
- Public event: 1