Finding and removing introns from RNA-Seq de novo assemblies with IntronSeeker

Finding and removing introns from RNA-Seq de novo assemblies with IntronSeeker - Published in JOSS (2024)

https://github.com/bios4biol/intronseeker

Scientific Fields

Artificial Intelligence and Machine Learning Computer Science - 40% confidence

Last synced: 6 months ago · JSON representation

Repository

IntronSeeker identifies potentially retained introns in de novo RNA-seq assembly in order to quantify and remove them.

Basic Info

Host: GitHub
Owner: Bios4Biol
License: gpl-3.0
Language: HTML
Default Branch: master
Size: 590 MB

Statistics

Stars: 1
Watchers: 2
Forks: 0
Open Issues: 5
Releases: 1

Created almost 2 years ago · Last pushed 10 months ago

Metadata Files

Readme Contributing License Code of conduct Authors

intronSeeker

RNA-Seq reads quite often contain non-spliced introns. These introns originating from non mature mRNAs can break the translated protein Open Reading Frame. In order to get the correct protein sequence introns have to be removed. Retained introns will be present in only some sequences nevertheless they can therefore be present in the corresponding assembled contigs.
Introns are retained by "accident" meaning due to non mature transcripts present in the cell. These transcripts have not undergone complete splicing and therefore still harbor introns. The upstream step is not data processing but a biological process called splicing.

Splice site signal of these introns can be used to find and remove them. Intron splice site boundaries (canonical or not), the Open Reading Frame size and related protein alignments can be used as different hints to measure the probabiliy of having a retained intron.

We developped a tool to identify potentially retained introns in de novo RNA-seq assembly in order to quantify and remove them.

This tool includes two types of RNA-seq data simulations to validate the detection process and measure the false positive detection rate.

The first simulation module uses random sequence simulation in order to check if splice aligners are able to find inserted introns when only contigs with introns and reads whithout intron are used as well as when contigs with and without introns and reads without introns are used.

The second simulation module uses an existing genome and corresponding annotation. In this case the simulator produces reads with an without intron as well as transcripts with and whithout introns. This modules enables to measure the fraction of retained introns which can be detected in real conditions and to set the appropriate detection thresholds.

Simulator and simulation data descriptions can be found in the SIMULATION.md file located in doc directory here

How it works ?

Searching for introns

Dashed lines connectors are used to represent optional inputs files. Solid lines connectors are used to represent inputs and outputs files.

Inputs data for intronSeeker are raw RNASeq paired reads (R1.fastq and R2.fastq). These reads are aligned on the contigs with STAR or HiSAT2, according to user preferences. Splices sites are searched and intronSeeker generate a list of candidates and a list of splits alignements. This candidat file is then tested to check if they correspond to valid intron retention events. A clean reference without introns is generated in Fasta format with TrimFasta step. From the reference and trimmed Fasta files, intronSeeker predicts the Open Reading Frames and find proteic hits to find evidences.

Simulated data are used for validating the method and measure the false positive detection rate. Intron detection never involve a simulation step.

The main pipeline step are: 1. Use Star or Hisat2 to map reads on assembled contigs.

Search splice events in bam file using CIGAR string.
Write a file of intron candidats.

How to install ?

To install intronSeeker, see the INSTALL.md file

How to use ?

Here, we'll present a non-exhaustive documentation on how to use the intronSeeker program. For an exhaustive documentation about the different functions (description, options...), please see the README file in data directory. We'll just present a fast and basic usage with examples. All the input files or output files presented here are available in data directory.

Before running any intronSeeker command, activate the conda environment with :

diff conda activate ISeeker_environment

Reads alignment on reference contigs : Hisat or Star.

From here, we will use files corresponding to reduced real dataset available in the data directory (this data comes from Ceanhorhabditis elegans).

intronSeeker can use two aligners : STAR and Hisat2. Hisat2 gives better alignment results but takes longer to run and works only with paired-end library.

The aligners we tested were the only two capable of handling alignments on contigs with and without the intron. We wrapped them to simplify their launch.

In introSeeker directory, in order to run alignment, use the commands :

diff ./intronSeeker starAlignment -r data/Reduced_real_dataset/Test_set_Cele_contig-assembly.fasta -1 data/Reduced_real_dataset/Test_set_Cele_reads-1.fastq.gz -2 data/Reduced_real_dataset/Test_set_Cele_reads-2.fastq.gz -o data/test_Reduced_real_dataset/Cele_library-contigs_starAlignment

or

diff ./intronSeeker hisat2Alignment -r data/Reduced_real_dataset/Test_set_Cele_contig-assembly.fasta -1 data/Reduced_real_dataset/Test_set_Cele_reads-1.fastq.gz -2 data/Reduced_real_dataset/Test_set_Cele_reads-2.fastq.gz -o data/test_Reduced_real_dataset/Cele_library-contigs_HISAT2Alignment N.B.: If you wish to re-run this step, it is necessary to change the name of your output directory.

Splicing event search

When the alignment is finished, you can search for splicing events with HiSAT2:

diff ./intronSeeker splitReadSearch -a data/test_Reduced_real_dataset/Cele_library-contigs_HISAT2Alignment/hisat2.sort.bam -r data/Reduced_real_dataset/Test_set_Cele_contig-assembly.fasta -o data/test_Reduced_real_dataset/Test_Cele_splicing_event_HISAT2

or with STAR:

diff ./intronSeeker splitReadSearch -a data/test_Reduced_real_dataset/Cele_library-contigs_starAlignment/star.sort.bam -r data/Reduced_real_dataset/Test_set_Cele_contig-assembly.fasta -o data/test_Reduced_real_dataset/Test_Cele_splicing_event_STAR

List features by FASTA trimming

When trimFastaFromTXT produces a new FASTA reference file where features, listed in candidats file, are trimmed from the reference FASTA.

diff ./intronSeeker trimFastaFromTXT -r data/Reduced_real_dataset/Test_set_Cele_contig-assembly.fasta -c data/test_Reduced_real_dataset/Test_Cele_splicing_event_HISAT2/srs_candidates.txt -o data/test_Reduced_real_dataset/Test_Cele_trimFASTA

Find evidence

If you have a reference protein set, you can perform a proteic alignment against the "reference" and "trimref" FASTA files to find evidences.

diff ./intronSeeker findEvidence -r data/Reduced_real_dataset/Test_set_Cele_contig-assembly.fasta -t data/Reduced_real_dataset/Test_Cele_trimFASTA/tf_trimmed.fa -d data/Reduced_real_dataset/Test_protein-diamond-database.fasta -c data/Reduced_real_dataset/Test_Cele_splicing_event_HISAT2/srs_candidates.txt -o data/Reduced_real_dataset/Test_Cele_FindEvidence

Generate a intronSeeker simulation report

Create a configuration file data/testReducedrealdataset/buildReportexample.cfg :

diff [Defaults] fasta:data/Reduced_real_dataset/Test_set_Cele_contig-assembly.fasta r1:data/Reduced_real_dataset/Test_set_Cele_reads-1.fastq.gz r2:data/Reduced_real_dataset/Test_set_Cele_reads-2.fastq.gz flagstat:data/test_Reduced_real_dataset/Cele_library-contigs_HISAT2Alignment/hisat2.sort.flagstat.txt candidat:data/test_Reduced_real_dataset/Test_Cele_splicing_event_HISAT2/srs_candidates.txt split:data/test_Reduced_real_dataset/Test_Cele_splicing_event_HISAT2/srs_split_alignments.txt prefix:Cele threads:6 output:test_HTML/ force: -F

diff ./intronSeeker buildReport -F --config_file data/test_Reduced_real_dataset/buildReport_example.cfg

Your HTML simulation report is available in test_HTML directory.

Wrappers description

Table: intronSeeker programs and wrappers.

Files formats

Candidats file format

Head of candidats text file.

| Column | Fields | Type | Description| | :--- | :----: | :----: | ---: | | 1 | ID | String | Candidat named by contigstart end | | 2 | reference | String | Contig name.| | 3 | start | Int | Start position of the candidat.| | 4 | end | Int | End position of the candidat.| | 5 | depth | Int | Number of reads by candidat (depth).| | 6 | splitborders | String | Junction sites. If split borders are GTAC (donor site) or CTAC (acceptor site) , then these junctions are canonic.| | 7 | DPbefore | Int | Mean depth for 10bp before the candidate.| | 8 | DPin | Int | Mean depth of candidate.| | 9 | DPafter | Int | Mean depth for 10bp after the candidate.| | 10 | filter | String | Filter candidates according to a minimum depth (DP), a maximum length (LEN), canonical junction (SS), or PASS introns.|

Table: Candidat file description.

Split file format

Head of split text file.

| Column | Fields | Type | Description| | :--- | :----: | :----: | ---: | | 1 | reference | String | Contig name.| | 2 | read | String | Read name. | |3 | startsplit| Int | Start position of the split read.| |4 | endsplit | Int | End position of the split read.| |5 |splitlength | Int | Length of split read.| |6 |splitborders| String | Junction sites. If split borders are GTAC (donor site) or CTAC (acceptor site) , then these junctions are canonic.| | 7 | strand | String | Defined as + (forward) or - (reverse).|

Table: Split file description.

Trimmed FASTA

Head of trimmed fasta file.

Contig without retained intron(s)

| Line | Name | Type | Description| | :--- | :----: | :----: | ---: | | 1 | Sequence description | String | Sequence name, start, stop and trimmed sequence name, start, stop.| | 2 | Fasta sequence | String | Trimmed fasta sequence|

Table: Trimmed fasta file description.

Grinder file

Grinder file is an input to simulate data with FRS (Full Random Simulation) or GBS (GTF based simulation). All parameters are described here: https://sourceforge.net/projects/biogrinder/files/ A grinder file is given for example: config/grinder_frs.cfg

Here are a few examples: diff -cf 100 # Fold coverage (100X). Choose either "tr" or "cf" parameter. tr is the number of reads to generate for each library. Example: -tr 2000000 -rd 100 normal 7.5 # Read length distribution -id 500 normal 250 # Insert distribution -mo FR # Mate orientation -md poly4 3e-3 3.3e-8 # Mutation distribution -mr 80 20 # Mutation ratio -am uniform # Abundance model : -am uniform for an homogen distribution, or -am powerlaw 0.5 for a powerlaw distribution -fq 1 # Generated reads in FASTQ format -ql 32 2 # Quality levels

IntronSeeker report

Each section of the HTML report corresponds to an intronSeeker step.

IntronSeeker menu

HTML report menu.

Inputs files list and description

First simulation report's tab aims to check input files names listed in buildReport.cfg config file or in parameters.

inputs files complete names.

A glossary provides definitions of the main terms used in the HTML report:

Main definitions.

Statistics and description of input files from data simulation step, and statistics from FASTA, modified FASTA, GTF and assemblathon files.

Inputs files statistics.

Graphical representation of statistics below

inputs files statistics.

Statistics from intronSeeker alignment step (STAR): Reads statistics from R1 (and R2) fastq file(s), alignment statistics from flagstat file.

STAR mapping statistics.

Warning: Mix-state parameter increase number of secondary alignments (with and without introns, with and without splicing).

Intron extraction results

Statistics from intronSeeker splitReadSeach step.

Statistics from split alignment and candidates
files.

The depth is the number of reads overlapping an intron. Difference between split mean length and detected intron mean length could be explain by the fact that several reads split on a same location so one candidat have several splits.

Evaluation of detected introns:

Comparison between filtered detected and detectables
introns.

Detected introns filtered from candidat file and detectable introns from GTF and modified FASTA files, filtered according to threshold (a minimum depth (DP) and a maximum length (LEN)), canonical junction (SS), and the number of PASS introns.

Top 10 of contigs having the highest number of detected introns tab list contig name, number of detected introns (filtered detected introns), and number of detectable introns (filtered features).

Top 10 of contigs with too complex detected and detectable introns on GBS data. Source: GBS simulation on 1 000 contigs (mixed state) of Drosophila m. with STAR mapping XXXXXXX

Evaluation of detected introns

Raw and filtered introns.

« Number of detected introns » / « raw » = Total number (PASS + DP + LEN + SS) of introns listed in candidat file.

« Number of detected introns » / «filtered» = Only « PASS » introns listed in candidat file.

« Number of features » / « raw » = Total number (PASS + DP + LEN + SS) of introns listed in GTF + modified fasta files.

« Number of features » / «filtered» = Only « PASS » listed in GTF + modified fasta files.

Evaluation/prediction of detected introns

Detection assessement.

Sensibility is the fraction of \"detected filtered\" (201) presents in «Truth » (200) found by intronSeeker (195). Specificity is the fraction of truth « detected filtered » among all predicted.

IntronSeeker standard filters

By running several simulations and then searching for introns in real data, we were able to define standard filters.

Introns from candidat file have been detected and filtered according to a minimum depth, a maximum length and split borders should be canonicals.

Too complex introns are also filtered. An intron is too complex whether because of its size or because its an overlapping intron.

| Filter name | Description| | :--- | :--- | | DP | Filtered because of depth (). Candidates will be flagged if not supported by more than [1] read.| | LEN | Filtered because of length (%). Remove transcript if intron length greater than 80% of the total exons length| | SS | Filtered because of non canonical junction.| | OI | Filtered because of overlapping introns.v | RDP | if (DPbefore + DPafter)/2 /5 < DP_in| | PASS | After filtering, candidates are flagged PASS.|

Table: IntronSeeker introns filters.

Contributing

To contribute to this project, please see the CONTRIBUTING.md file.

Owner

Login: Bios4Biol
Kind: user
Company: @INRA

Repositories: 1
Profile: https://github.com/Bios4Biol

JOSS Publication

Finding and removing introns from RNA-Seq de novo assemblies with IntronSeeker

Published

November 15, 2024

DOI

10.21105/joss.06272

Volume 9, Issue 103, Page 6272

Authors

Sarah Maman

Sigenae, GenPhySE, Université de Toulouse, INRAE, ENVT, F-31326, Castanet Tolosan, France.

Philippe Bardou

Sigenae, GenPhySE, Université de Toulouse, INRAE, ENVT, F-31326, Castanet Tolosan, France.

Emilien Lasguignes
Université Fédérale de Toulouse, INRAE, BioinfOmics, GenoToul Bioinformatics facility, Sigenae, 31326, Castanet-Tolosan, France

Faustine Oudin
Université Fédérale de Toulouse, INRAE, BioinfOmics, GenoToul Bioinformatics facility, Sigenae, 31326, Castanet-Tolosan, France

Floréal Cabanettes
Université Fédérale de Toulouse, INRAE, BioinfOmics, GenoToul Bioinformatics facility, Sigenae, 31326, Castanet-Tolosan, France

Christophe Klopp

Université Fédérale de Toulouse, INRAE, BioinfOmics, GenoToul Bioinformatics facility, Sigenae, 31326, Castanet-Tolosan, France

Editor

Kelly Rowland

GitHub Events

Total

Release event: 1
Push event: 4
Pull request event: 2
Fork event: 2
Create event: 1

Last Year

Release event: 1
Push event: 4
Pull request event: 2
Fork event: 2
Create event: 1

Committers

Last synced: 11 months ago

All Time

Total Commits: 861
Total Committers: 9
Avg Commits per committer: 95.667
Development Distribution Score (DDS): 0.461

Past Year

Commits: 29
Committers: 4
Avg Commits per committer: 7.25
Development Distribution Score (DDS): 0.172

Top Committers

Name	Email	Commits
Sarah Maman-Haddad	s**d@i**r	464
Sarah Maman	s**n@i**r	200
pbardou	p**u@i**r	89
Bios4Biol	3****l	53
Emilien Lasguignes	e**s@i**r	40
ad_miat	a**t@m**r	7
Christophe Klopp	c**p@i**r	5
Philippe Bordron	p**n@i**r	2
Juanjo Bazán	j**n@g**m	1

Committer Domains (Top 20 + Academic)

inra.fr: 4 inrae.fr: 2 montcalm.toulouse.inra.fr: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 7
Total pull requests: 4
Average time to close issues: 15 days
Average time to close pull requests: about 1 hour
Total issue authors: 3
Total pull request authors: 2
Average comments per issue: 2.86
Average comments per pull request: 0.0
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 4
Pull requests: 3
Average time to close issues: 7 days
Average time to close pull requests: about 1 hour
Issue authors: 2
Pull request authors: 2
Average comments per issue: 1.75
Average comments per pull request: 0.0
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Finding and removing introns from RNA-Seq de novo assemblies with IntronSeeker

Science Score: 93.0%

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.md

intronSeeker

How it works ?

Searching for introns

How to install ?

How to use ?

Reads alignment on reference contigs : Hisat or Star.

Splicing event search

List features by FASTA trimming

Find evidence

Generate a intronSeeker simulation report

Wrappers description

Files formats

Candidats file format

Split file format

Trimmed FASTA

Grinder file

IntronSeeker report

IntronSeeker menu

Inputs files list and description

Intron extraction results

IntronSeeker standard filters

Contributing

Owner

JOSS Publication

Finding and removing introns from RNA-Seq de novo assemblies with IntronSeeker

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies