isocomp

Isocomp provides tools to compare any number of transcriptome assemblies (GTF + fasta) from long read RNAseq

https://github.com/collaborativebioinformatics/isocomp

Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary

Keywords

long-read-sequencing ont oxford-nanopore pacbio rnaseq transcriptomics

Last synced: 6 months ago · JSON representation ·

Repository

Isocomp provides tools to compare any number of transcriptome assemblies (GTF + fasta) from long read RNAseq

Basic Info

Host: GitHub
Owner: collaborativebioinformatics
License: mit
Language: HTML
Default Branch: develop
Homepage:
Size: 49 MB

Statistics

Stars: 2
Watchers: 6
Forks: 8
Open Issues: 4
Releases: 1

Topics

long-read-sequencing ont oxford-nanopore pacbio rnaseq transcriptomics

Created over 3 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Changelog License Citation

IsoComp: Comparing isoform composition between cohorts using high-quality long-read RNA-seq

ISOCOMP

Contributors
Introduction
Aim
Workflow
Running the pipeline
Methods

Contributors:

Yutong Qiu (Carnegie Mellon)
Chia Sin Liew (University of Nebraska-Lincoln)
Chase Mateusiak (Washington University)
Rupesh Kesharwani (Baylor College of Medicine)
Bida Gu (University of Southern California)
Muhammad Sohail Raza (Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation)
Evan Biederstedt (HMS)
Umran Yaman (UK Dementia Research Institute, University College London)
Abdullah Al Nahid (Shahjalal University of Science and Technology)
Trinh Tat (Houston Methodist Research Institute)
Sejal Modha (Theolytics Limited)
Jędrzej Kubica (University of Warsaw)

Introduction

Transcriptomic profiling has gained traction over the past few decades, but its progress has been hindered by short-read sequencing, particularly in tasks such as inferring alternative splicing, allelic imbalance, and isoform variation due to read length and required assembly.

The potential of long-read sequencing lies in its ability to overcome the inherent limitations of short-reads. Tools like Isoseq3 [link: https://www.pacb.com/products-and-services/applications/rna-sequencing/] offer high-quality, polished, assembled full-length isoforms. This advancement allows us to identify alternatively spliced isoforms and detect gene fusions. Further, with the introduction of HiFi sequencing, the error rates have significantly decreased in third-generation sequencing long reads.

Aim

The aim of this project is to algorithmically characterize the "unique" (differing) isoforms between any number of samples using high-quality assembled isoforms.

Workflow

Running the pipeline

Installation

pip install isocomp==0.3.0

For guidelines run:

isocomp --help

Step 1. Create windows

isocomp create_windows -i sample1.gtf sample2.gtf sample3.gtf -f transcript -o clustered_file.gtf

Step 2. Find unique isoforms across multiple samples

isocomp find_unique_isoforms -a clustered_file.gtf -f fasta_map.csv

File fasta_map.csv:

source,fasta NA24385.filtered,BCM-data-HG002-All2Samples-hg38-Results/NA24385_HG002/MMSQANTI3Filter/NA24385.filtered.fasta NA26105.filtered,BCM-data-HG002-All2Samples-hg38-Results/NA26105_GM26105/MMSQANTI3Filter/NA26105.filtered.fasta

Example output

For each isoform that is unique to at least one sample, we provide information about the read and the similarity between that isoform and the most similar isoform within the same window.

The last column describes the normalized edit distance and the CIGAR string.

win_chr win_start win_end total_isoform isoform_name sample_from sample_compared_to mapped_start isoform_sequence selected_alignments NC_060925.1 255178 288416 4 PB.6.2 HG004 HG002 255173 GGATTATCCGGAGCCAAGGTCCGCTCGGGTGAGTGCCCTCCGCTTTTT 0.02_HG002_PB.6.2_3=6I1=3I1286=11I NC_060925.1 255178 288416 4 PB.6.2 HG004 HG005 255173 GGATTATCCGGAGCCAAGGTCCGCTCGGGTGAGTGCCCTCCGCTTTTTG 0.02_HG002_PB.6.2_3=6I1=3I1286=11

Detailed project overview

https://github.com/collaborativebioinformatics/isocomp/blob/main/FinalPresentationBCMHackathon_12Oct2022.pdf

Methods

Overview of Methods

The core challenge, referred to as the "Isoform set comparison problem," involves identifying distinct isoforms between two sets of samples.

A direct approach to solving this problem is through sequence matching between the sets of isoforms. However, this becomes time-consuming given the size of isoform sets in the human genome (consider a number significantly larger than 10,000, for instance).

We recognize that an all-against-all alignment across complete isoform sets isn't necessary. Instead, the focus is on comparing isoforms aligned to the same genomic regions. Genomic windows containing at least one isoform from any sample are extracted. The isoform sets are then subdivided into smaller subsets based on their origin in these extracted regions.

For each pair of samples under comparison, intersections are made between subsets of isoforms within each genomic window. This process identifies isoforms shared by both samples and isoforms unique to each sample.

For each unique isoform S from sample A, a deeper analysis is conducted on the differences between S and other isoforms from sample B within the same genomic window.

Aligning isoforms to the reference genome

For each individual sample, we initially prefix the sample name to the FASTA sequences in the finalized corrected FASTA output from SQANTI. This step ensures the uniqueness of all sequence names. Subsequently, we employ the minimap2 aligner (v2.24-r1122) to align the renamed FASTA sequences against the human Telomere-to-Telomere genome assembly of the CHM13 cell line (T2T-CHM13v2.0; RefSeq - GCF_009914755.1). The resultant alignment is presented in a SAM file, which is then converted into BAM format and sorted using samtools (v1.15.1; Danecek et al, 2021).

Segmentation of Isoforms into Subsets

Regions from the CHM13v2.0 genome that intersect with at least one isoform from any given sample are extracted. We initially determine the average coverage of isoforms per base using samtools mpileup (version?, Danecek et al, 2021). Subsequently, we identify and extract the 20,042 annotated protein-coding gene regions from the reference genome. To create windows, we merge these regions where overlaps occur. Further refinement is applied by filtering windows to those displaying per-base coverage greater than 0.05, resulting in a final set of 11,936 windows.

Apart from the annotated gene regions, each sample encompasses over 100,000 isoforms (Table 1) that align with intron regions. These isoforms, often considered novel, hold potential relevance to the observed phenotypes. To account for these, we divide the genome into 100-base-pair windows and retain those exhibiting per-base coverage exceeding 0.05.

Following this, the gene-related windows and the 100-base-pair windows are merged to form a comprehensive set of windows aligning with any isoform.

Intersecting subsets of isoforms

For every isoform S within the subset of sample A, we conduct precise string matching against all isoforms within the subset of sample B. If no isoform in sample B, within the same genomic window, precisely matches S, we classify S as unique.

Comparing unique isoforms with other isoforms

For each unique isoform U, we employ the Needleman-Wunch alignment method to compare U with other isoforms within the identical genomic window. The comparison is quantified through the percentage of matched bases in U.

Annotating the differences between unique isoforms and the other sequences

Differences among isoforms are categorized into [TODO] SNPs (<5bp), large-scale variants (>5bp), gene fusions, distinct exon utilization, and entirely novel sequences. These categories build upon those used by SQANTI to annotate disparities between sample isoforms and the reference transcriptome. Notably, we extend SQANTI's categories by incorporating SNPs and large-scale variants.

Iso-Seq analysis

Isoseq3 (v3.2.2) generated HQ (Full-length high quality) transcripts [Table 1] were mapped to GRCh38 (v33 p13) using Minimap2 long-read alignment tools 1. Basic statistics of the alignment for each sample [NA24385 /HG002, NA24143/HG004, and NA24631/HG005] are provided in Table 2. The cDNAcupcake workflow [https://github.com/Magdoll/cDNACupcake] was then executed to collapse redundant isoforms from the BAM file. Low-count isoforms (<10) were filtered out, as well as 5' degraded isoforms that might lack biological significance. Subsequently, SQANTI3 [2] was employed to generate the final corrected fasta [Table 3a] transcripts and GTF [Table 3b] files, along with isoform classification reports. External databases, including the reference data set of transcription start sites (refTSS), a list of polyA motifs, tappAS-annotation, and Genecode hg38 annotation, were utilized. Finally, IsoAnnotLite (v2.7.3) analysis was conducted to annotate the GTF file from SQANTI3.

Differences between isoforms are categorized into [TODO] SNPs (<5bp), large-scale variants (>5bp), gene fusion, different exon usage, and completely novel sequences. These categories build upon those used by SQANTI to annotate disparities between sample isoforms and the reference transcriptome. Note that we extend the categories provided by SQANTI by adding SNPs and large-scale variants.

DEPENDENCIES

python >=3.9

If you're working on ada, you'll need to update the old, crusty version of python to something more modern and exciting.

The easy way (untested, but should work):

Install miniconda and create a conda env with python 3.9

The manual method (source) (tested, works):

``` ssh ... # your username login to ada

mkdir /home/${USER}/.local

use your favorite text editor. no need to be vim

vim /home/${USER}/.bashrc

add the following to the end (or where ever)

export PATH=/home/$USER/.local/bin:$PATH

logout of the current session and log back in

exit ssh ... (your username, etc)

Download a more current version of python

wget https://www.python.org/ftp/python/3.9.15/Python-3.9.15.tgz

unpack

tar xfp Python-3.9.15.tgz

remove the tarball

rm Python-3.9.15.tgz

cd into the Python package dir, configure and make

cd Python-3.9.15/

./configure --prefix=/home/${USER}/.local --exec_prefix=/home/${USER}/.local --enable-optimizations

make # this takes some time

make altinstall

the following should point at a python in your /home/$USER/.local/bin dir

which python3.9

optional, but convenient

ln -s /home/$USER/.local/bin/python3.9 /home/$USER/.local/bin/python

Download the pip installer

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

install pip

python3.9 get-pip.py

confirm that pip is where you think it is

which pip # location should be in your .local

at this point, you can do:

pip install poetry

and continue with the development install below

```

Github Codespace for Development

To use codespaces for development purposes, do the following:

fork the repo
switch to the 'develop' branch
- NOTE: if you plan to code/add a feature, create a branch from the 'develop' branch. Switch to it, and then continue on with the steps below.
click the green 'code' button. But, rather than copying the https or ssh link, click the tab that says "Codespace"
click the button that says "create codespace on develop". Go make some tea -- it takes ~5 minutes or so to set up the environment. But, once it is set up, you will have a fully functioning vscode environment with all the dependencies installed. Start running the tests, set some breakpoints, take a look around!

Development

Install poetry and consider setting the configuration such that virtual environments for a given projects are installed in that project directory.

Next, I like working on a fork rather than the actual repository of record. I set my git remotes so that origin points to my fork, and upstream points to the 'upstream' repository.

bash ➜ isocomp git:(develop) ✗ git remote -v origin https://github.com/cmatKhan/isocomp.git (fetch) origin https://github.com/cmatKhan/isocomp.git (push) upstream https://github.com/collaborativebioinformatics/isocomp.git (fetch) upstream https://github.com/collaborativebioinformatics/isocomp.git (push)

On your machine, cd into your local repository, git checkout the development branch, and make sure it is up-to-date with the upstream (ie the original) repository.

NOTE: if you branch, in general make sure you branch off the develop repo, not main!

Then (assuming poetry is installed already), do:

bash $ poetry install

This will install the virtual environment with the dependencies (and the dependencies' dependencies) listed in the pyproject.toml.

Adding dependencies

To add a development dependency (eg, mkdocs is not something a user needs), use poetry add -D <dependency> this is equivalent to pip installing into your virtual environment with the added benefit that the dependency is tracked in the pyproject.toml.

To add a deployment dependency, just omit the -D flag.

Writing code

Do this first!

bash $ pip install -e .

This is an 'editable install' and means that any change you make in your code is immediately available in your environment. NOTE: If you happen to see a Logging Error when you run the pip install -e . command, you can ignore it.

If you use vscode, this is a useful plugin which will automatically generate docstrings for you. Default docstring format is google, which is what the scripts we currently have use. This is an example of what a google formatted docstring looks like:

```python def getallwindows(genedf:pd.DataFrame, bpdf:pd.DataFrame) -> pd.DataFrame: """From gene boundaries and 100 bp nonzero coverage windows, produce a merged window df

Args:
    gene_df (pd.DataFrame): one window per gene, > 0.05 avg coverage
    bp_df (pd.DataFrame): one window per 100 bp, > 0.05 avg coverage

Returns:
    pd.DataFrame: merged windows df
"""
...

```

In the function definition, the type hints of the arguments (eg gene_df:pdDataFrame) are not required, but if you include them, autoDocs will automatically generate the data types in the docstring skeleton, also, which is nice. The -> <datatype> at the end of the function definition is the return data type.

Tests

Unit tests can be written into the src/tests directory. There is an example in src/tests/test_isocomp.py. There are a couple other examples of tests -- ie for logging and error handling -- here, too.

Build

It's good to intermittently build the package as you go. To do so, use poetry build which will create a .whl and .tar.gz (dist is already included in the gitignore). You can 'distribute' these files to others -- they can be installed with pip or conda -- or use them to install the software outside of your current virtual environment.

Documentation

If you would like to write documentation (ie not docstrings, but long form letters to your adoring users), then this can be done in markdown or jupyter notebooks (already added as a dev dependency) in the docs directory. Add the markdown/notebook document to the nav section in the mkdocs.yml and it will be added to the menu of the documentation site. Use mkdocs serve locally to see what the documentation looks like. mkdocs build will build the site in a directory called site, which is in the .gitignore already. Like poetry build it is a good diea to do mkdocs build intermittently as you write documentation. Eventually, we'll use mkdocs gh-deploy to deploy the site to github pages. Maybe if we get fancy, we'll set up the github actions to build the package on mac,windows and linux OSes on every push to develop, and rebuild the docs and push the package to pypi on every push to main.

Computational Resources / Operation

On a 16 core cloud instance with three GTF files (replicates) from Genome in a Bottle subject HG002, find_unique_isoforms took ~15 minutes and ~3 to 5 GB of RAM

Owner

Name: collaborativebioinformatics
Login: collaborativebioinformatics
Kind: organization

Repositories: 42
Profile: https://github.com/collaborativebioinformatics

Citation (CITATIONS.md)

# If you use this repo, please cite:

>Qiu, Y., Liew, C. S., Mateusiak, C., Kesharwani, R., Gu, B., Raza, M. S., Biederstedt, E., Yaman, U., Al Nahid, A., Tat, T., Modha, S., & Kubica, J. (2023). Isocomp. Carnegie Mellon, University of Nebraska-Lincoln, Washington University, Baylor College of Medicine, University of Southern California, Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation, HMS, UK Dementia Research Institute, University College London, Shahjalal University of Science and Technology, Houston Methodist Research Institute, Theolytics Limited, University of Warsaw. https://github.com/collaborativebioinformatics/isocomp

## Significant dependencies

### BioPython

> Cock, P.J.A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009 Jun 1; 25(11) 1422-3 https://doi.org/10.1093/bioinformatics/btp163 pmid:19304878

### edlib

>Martin Šošić, Mile Šikić; Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance. Bioinformatics 2017 btw753. doi: 10.1093/bioinformatics/btw753

### PyRanges

> Endre Bakken Stovner , Pål Sætrom, PyRanges: efficient comparison of genomic intervals in Python, Bioinformatics, Volume 36, Issue 3, February 2020, Pages 918–919, https://doi.org/10.1093/bioinformatics/btz615

GitHub Events

Total

Watch event: 1
Fork event: 1

Last Year

Watch event: 1
Fork event: 1

Dependencies

poetry.lock pypi

appnope 0.1.3 develop
asttokens 2.0.8 develop
atomicwrites 1.4.1 develop
attrs 22.1.0 develop
backcall 0.2.0 develop
beautifulsoup4 4.11.1 develop
bleach 5.0.1 develop
certifi 2022.9.24 develop
cffi 1.15.1 develop
charset-normalizer 2.1.1 develop
click 8.1.3 develop
colorama 0.4.6 develop
debugpy 1.6.3 develop
decorator 5.1.1 develop
defusedxml 0.7.1 develop
entrypoints 0.4 develop
executing 1.1.1 develop
fastjsonschema 2.16.2 develop
ghp-import 2.1.0 develop
idna 3.4 develop
importlib-metadata 5.0.0 develop
ipykernel 6.16.2 develop
ipython 8.5.0 develop
jedi 0.18.1 develop
jinja2 3.1.2 develop
jsonschema 4.16.0 develop
jupyter-client 7.4.4 develop
jupyter-core 4.11.2 develop
jupyterlab-pygments 0.2.2 develop
jupytext 1.14.1 develop
lxml 4.9.1 develop
markdown 3.3.7 develop
markdown-it-py 2.1.0 develop
markupsafe 2.1.1 develop
matplotlib-inline 0.1.6 develop
mdit-py-plugins 0.3.1 develop
mdurl 0.1.2 develop
mergedeep 1.3.4 develop
mistune 0.8.4 develop
mkdocs 1.4.1 develop
mkdocs-autorefs 0.4.1 develop
mkdocs-jupyter 0.22.0 develop
mkdocs-material 8.5.7 develop
mkdocs-material-extensions 1.1 develop
mkdocs-section-index 0.3.4 develop
mkdocstrings 0.19.0 develop
more-itertools 9.0.0 develop
nbclient 0.7.0 develop
nbconvert 6.5.4 develop
nbformat 5.7.0 develop
nest-asyncio 1.5.6 develop
packaging 21.3 develop
pandocfilters 1.5.0 develop
parso 0.8.3 develop
pexpect 4.8.0 develop
pickleshare 0.7.5 develop
pluggy 0.13.1 develop
prompt-toolkit 3.0.31 develop
psutil 5.9.3 develop
ptyprocess 0.7.0 develop
pure-eval 0.2.2 develop
py 1.11.0 develop
pycparser 2.21 develop
pygments 2.13.0 develop
pymdown-extensions 9.7 develop
pyparsing 3.0.9 develop
pyrsistent 0.18.1 develop
pytest 5.4.3 develop
pywin32 304 develop
pyyaml 6.0 develop
pyyaml-env-tag 0.1 develop
pyzmq 24.0.1 develop
requests 2.28.1 develop
soupsieve 2.3.2.post1 develop
stack-data 0.5.1 develop
tinycss2 1.2.1 develop
toml 0.10.2 develop
tornado 6.2 develop
traitlets 5.5.0 develop
urllib3 1.26.12 develop
watchdog 2.1.9 develop
wcwidth 0.2.5 develop
webencodings 0.5.1 develop
zipp 3.10.0 develop
numpy 1.23.4
pandas 1.5.1
pybedtools 0.9.0
pysam 0.20.0
python-dateutil 2.8.2
pytz 2022.5
six 1.16.0

pyproject.toml pypi

ipykernel ^6.16.2 develop
mkdocs ^1.4.1 develop
mkdocs-autorefs ^0.4.1 develop
mkdocs-jupyter ^0.22.0 develop
mkdocs-section-index ^0.3.4 develop
mkdocstrings ^0.19.0 develop
pytest ^5.2 develop
pandas ^1.5.1
pybedtools ^0.9.0
python ^3.9

.github/workflows/python-package.yml actions

actions/cache v2 composite
actions/checkout v2 composite
actions/setup-python v2 composite

.devcontainer/Dockerfile docker

python 3.9 build

Dockerfile docker

python 3.9-alpine build