isocon

Derives consensus sequences from a set of long noisy reads by clustering and error correction.

https://github.com/ksahlin/isocon

Science Score: 20.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: nature.com
✓
Committers with academic emails
1 of 3 committers (33.3%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.5%) to scientific vocabulary

Keywords

bioinformatics ccs clustering error-correction long-reads transcriptome

Last synced: 6 months ago · JSON representation

Repository

Derives consensus sequences from a set of long noisy reads by clustering and error correction.

Basic Info

Host: GitHub
Owner: ksahlin
License: gpl-3.0
Language: Python
Default Branch: master
Homepage:
Size: 1.15 MB

Statistics

Stars: 16
Watchers: 4
Forks: 1
Open Issues: 6
Releases: 0

Topics

bioinformatics ccs clustering error-correction long-reads transcriptome

Created about 9 years ago · Last pushed over 4 years ago

Metadata Files

Readme Changelog License

IsoCon

IsoCon is distributed as a python package supported on Linux / OSX with python v2.7, and versions 3.6-3.8

IsoCon is a tool for reconstructing highly similar sequences present in a dataset of from long noisy reads. Its original use case was transcripts from highly similar gene copies (paper here), however the methodology extends to any dataset where sequences spans the region(s) of interest end-to-end. IsoCon use examples:

Deriving finished transcripts from Iso-Seq or ONT reads from targeted sequencing of gene families using primers.
Deriving consensus sequence from several passes of long noisy reads (e.g., pacbio polymerase reads to CCS or ONT Rolling Circle Amplification to Concatemeric Consensus (R2C2)).
Deriving viral strains from reads (assuming the reads spans the viral sequence, e.g., as for HIV).
Deriving consensus ribosomal RNA.
Deriving consensus from any targeted amplicone based sequencing technique.

Simplest usage is an input file of fastq or fasta containing reads. IsoCon can be run as follows

IsoCon pipeline -fl_reads <reads.fastq> -outfolder </path/to/output>

IsoCon pipeline -fl_reads <reads.fasta> -outfolder </path/to/output> --ccs </path/to/filename.ccs.bam>

predicted transcripts are found in file /path/to/output/final_candidates.fa. Reads that could not be corrected or clustered are found in /path/to/output/not_converged.fa.

Can IsoCon be run on nontargeted Iso-Seq datasets? see here.
How does my data set affect the runtime? see here

For more instructions see below.

Table of Contents
INSTALLATION
USAGE
CREDITS
LICENCE

INSTALLATION

Using conda

Conda is the preferred way to install IsoCon.

Create and activate a new environment called IsoCon

conda create -n IsoCon python=3.8 pip conda activate IsoCon

Install IsoCon

pip install IsoCon

You should now have 'IsoCon' installed; try it:

IsoCon --help

Upon start/login to your server/computer you need to activate the conda environment "IsoCon" to run IsoCon as:

conda activate IsoCon

Using pip

pip is pythons official package installer. This section assumes you have python (v2.7 or >=3.6) and a recent version of pip installed which should be included in most python versions. If you do not have pip, it can be easily installed from here and upgraded with pip install --upgrade pip.

With python and pip available, create a file requirements.txt with contents copied from this file. Then, type in terminal

pip install --requirement requirements.txt IsoCon

This should install IsoCon. With proper installation of IsoCon, you should be able to issue the command IsoCon pipeline to view user instructions. You should also be able to run IsoCon on this small dataset. Simply download the test dataset and run:

IsoCon pipeline -fl_reads [path/simulated_pacbio_reads.fa] -outfolder [output path]

pip will install the dependencies automatically for you. IsoCon has been built with python 2.7, 3.4-3.6 on Linux systems using Travis. For customized installation of latest master branch, see below.

Downloading source from GitHub

Dependencies

Make sure the below listed dependencies are installed (installation links below). Versions in parenthesis are suggested as IsoCon has not been tested with earlier versions of these libraries. However, IsoCon may also work with earliear versions of these libaries. * edlib, for installation see link (>= v1.1.2) * networkx (>= v1.10) * parasail * pysam (>= v0.11)

With these dependencies installed. Run

sh git clone https://github.com/ksahlin/IsoCon.git cd IsoCon ./IsoCon

USAGE

IsoCon's algorithm consists of two main phases; the error correction step and the statistical testing step. IsoCon can run these two steps in one go using IsoCon pipeline, or it can run only the correction or statistical test steps using IsoCon get_candidates and IsoCon stat_filter respectively. The preffered and most tested way is to use the entire pipeline IsoCon pipeline, but the other two settings can come in handy for specific cases. For example, running only IsoCon get_candidates will give more sequences if one is not concerned about precision and will also be faster, while one might use only IsoCon stat_filter using different parameters for a set of already constructed candidates in order to prevent rerunning the error correction step.

Pipeline

Using quality values (fastq) is preferred over fasta as IsoCon uses the quality values for statistical analysis.

IsoCon pipeline -fl_reads <reads.fast[a/q]> -outfolder </path/to/output>

Output

The final high quality transcripts are written to the file final_candidates.fa in the output folder. If there was only one or two reads coming from a transcript, which is sufficiently different from other reads (exon difference), it will be output in the file not_converged.fa. This file may contain other erroneous reads such as chimeras. The output also contains a file cluster_info.tsv that shows for each read which candidate it was assigned to in final_candidates.fa.

get_candidates

Runs only the error correction step. The output is the converged candidates in a fasta file.

IsoCon get_candidates -fl_reads <flnc.fast[a/q]> -outfolder </path/to/output>

stat_filter

Runs only the statistical filtering of candidates.

IsoCon pipeline -fl_reads <flnc.fast[a/q]> -outfolder </path/to/output> -candidates <candidate_transcripts.fa> Observe that candidate_transcripts.fa does not have to come from IsoCon's error correction algorithm. For example, this could either be a set of already validated transcripts to which one would like to see if they occur in the reads, or they could be Illumina (or in other ways) corrected CCS reads.

CREDITS

Please cite [1] when using IsoCon.

Kristoffer Sahlin, Marta Tomaszkiewicz, Kateryna D. Makova†, Paul Medvedev† Deciphering highly similar multigene family transcripts from iso-seq data with isocon. Nature Communications, 9(1):4601, 2018. Link.

LICENCE

GPL v3.0, see LICENSE.txt.

Owner

Name: Kristoffer
Login: ksahlin
Kind: user

Website: http://sahlingroup.github.io/
Repositories: 26
Profile: https://github.com/ksahlin

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Committers

Last synced: about 2 years ago

All Time

Total Commits: 387
Total Committers: 3
Avg Commits per committer: 129.0
Development Distribution Score (DDS): 0.008

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Kristoffer Sahlin	k**4@p**u	384
Kristoffer	k**n@g**m	2
pashadag	p**g@g**m	1

Committer Domains (Top 20 + Academic)

psu.edu: 1

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 9
Total pull requests: 0
Average time to close issues: about 22 hours
Average time to close pull requests: N/A
Total issue authors: 6
Total pull request authors: 0
Average comments per issue: 5.33
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

ksahlin (4)
uqvirg (1)
nextgenusfs (1)
wyim-pgl (1)
edgardomortiz (1)
wheatwill (1)

Pull Request Authors

Top Labels

Issue Labels

question (3)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 12 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 6
Total maintainers: 1

pypi.org: isocon

Pipeline for obtaining non-redundant haplotype specific transcript isoforms using PacBio IsoSeq reads.

Homepage: https://github.com/ksahlin/IsoCon
Documentation: https://isocon.readthedocs.io/
License: gpl-3.0
Latest release: 0.3.3
published almost 6 years ago

Versions: 6
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 12 Last month

Rankings

Dependent packages count: 7.3%

Stargazers count: 15.6%

Average: 19.4%

Dependent repos count: 22.1%

Forks count: 22.7%

Downloads: 29.4%

Maintainers (1)

ksahlin

Last synced: 6 months ago

isocon

Science Score: 20.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

IsoCon

Table of Contents

INSTALLATION

Using conda

Using pip

Downloading source from GitHub

Dependencies

USAGE

Pipeline

Output

get_candidates

stat_filter

CREDITS

LICENCE

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: isocon

Rankings

Maintainers (1)