shmlast
shmlast: An improved implementation of Conditional Reciprocal Best Hits with LAST and Python - Published in JOSS (2017)
Science Score: 93.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: joss.theoj.org, zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords from Contributors
Scientific Fields
Repository
blast, shmlast
Basic Info
Statistics
- Stars: 22
- Watchers: 2
- Forks: 8
- Open Issues: 8
- Releases: 6
Metadata Files
README.md
shmlast
An improved implementation of Conditional Reciprocal Best Hits with LAST and Python
shmlast is a reimplementation of the Conditional Reciprocal Best Hits algorithm for finding potential orthologs between a transcriptome and a species-specific protein database. It uses the LAST aligner and the pydata stack to achieve much better performance while staying in the Python ecosystem.
About
Conditional Reciprocal Best Hits (CRBH) was originally described by Aubry et al. 2014 and implemented in the crb-blast package. CRBH builds on the traditional Reciprocal Best Hits (RBH) method for orthology assignment by training a simple model of the e-value cutoff for a particular length of sequence on an initial set of RBH's. From its github repository:
"Reciprocal best BLAST is a very conservative way to assign orthologs. The main innovation in
CRB-BLAST is to learn an appropriate e-value cutoff to apply to each pairwise alignment by taking
into account the overall relatedness of the two datasets being compared. This is done by fitting a
function to the distribution of alignment e-values over sequence lengths. The function provides the
e-value cutoff for a sequence of given length."
Unfortunately, the original implementation uses NCBI BLAST+ (which is incredibly slow), and is implemented in Ruby, which requires users to leave the Python-dominated bioinformatics software system. shmlast makes this algorithm available to users in Python-land, while also greatly improving performance by using LAST for initial homology searches. Additionally, shmlast outputs both the raw parameters and a plot of its model for inspection.
shmlast is designed for finding orthologs between transcriptomes and protein databases. As such, it currently does not support nucleotide-nucleotide or protein-protein alignments. This may be changed in a future version, but for now, it remains focused on that task.
Also note that RBH, and by extension CRBH, is meant for comparing between two species. Neither of these methods should be used for annotating a transcriptome with a mixed protein database (like, for example, uniref90).
Usage
For some transcriptome transcripts.fa and some protein database pep.faa, the basic usage is:
bash
shmlast crbl -q transcripts.fa -d pep.faa
shmlast can be distributed across multiple cores using the --n_threads option.
bash
shmlast crbl -q transcripts.fa -d pep.faa --n_threads 8
Another use case is to perform simple Reciprocal Best Hits; this can be done with the rbl
subcommand. The maximum expectation-value can also be specified with -e.
bash
shmlast rbl -q transcripts.fa -d pep.faa --e 0.000001
Output
shmlast outputs a plain CSV file with the CRBH's, which by default will be named $QUERY.x.$DATABASE.crbl.csv. This CSV
file can be easily parsed with Pandas like so:
```Python import pandas as pd
crbldf = pd.readcsv('query.x.database.crbl.csv') ```
The columns are:
- E: The e-value.
- EG2: Expected alignments per square gigabase.
- E_scaled: E-value rescaled for the model (see below for details).
- ID: A unique ID for the alignment.
- bitscore: The bitscore, calculated as (lambda * score - ln[K]) / ln[2].
- qalnlen: Query alignment length.
- q_frame: Frame in the query translation.
- q_len: Length of the query sequence.
- q_name: Name of the query sequence.
- q_start: Start of query alignment. 11.q_strand: Strand of query alignment.
- salnlen: Length of subject alignment.
- s_len: Length of subject sequence.
- s_name: Name of subject sequence.
- s_start: Start of subject alignment.
- s_strand: Strand of subject alignment.
- score: The alignment score.
See http://last.cbrc.jp/doc/last-evalues.html for more information on e-values and scores.
Model Output
shmlast also outputs its model, both in CSV format and as a plot. The CSV file is named
$QUERY.x.$DATABASE.crbl.model.csv, and has the following columns:
- center: The center of the length bin.
- size: The size of the bin.
- left: The left of the bin.
- right: The right of the bin.
- fit: The scaled e-value cutoff for the bin.
To fit the model, the e-values are first scaled to a more suitable range using the equation
Es = -log10(E), where Es is the scaled e-value. e-values of 0 are set to an arbitrarily small
value to allow for log-scaling. The fit column of the model is this scaled value.
The model plot is named $QUERY.x.$DATABASE.crbl.model.plot.pdf by default.
Installation
via Conda
conda is the preferred installation method. shmlast is hosted on bioconda and it can be installed along with its dependencies using:
bash
conda install shmlast -c bioconda
PyPI
If you really want to avoid conda, you can install via PyPI with:
bash
pip install shmlast
After which you'll beed to install the third-party dependencies manually.
Third-party Dependencies
shmlast requires the LAST aligner and gnu-parallel. These will be installed automatically via conda if you choose that route; some other ways to install them follow.
Manually
LAST can be installed manually into your home directory like so:
bash
cd
curl -LO http://last.cbrc.jp/last-658.zip
unzip last-658.zip
pushd last-658 && make && make install prefix=~ && popd
And a recent version of gnu-parallel can be installed like so:
bash
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
Through a Package Manager
For Ubuntu 16.04 or newer, sufficiently new versions of both are available through the package manager:
bash
sudo apt-get install last-align parallel
For OSX, you can get LAST through the homebrew-science channel:
bash
brew tap homebrew/science
brew install last
Library
shmlast is also a Python library. Each component of the pipeline is implemented as a
pydoit task and can be used in doit workflows, and the implementations for calculating best hits,
reciprocal best hits, and conditional reciprocal best hits are usable as Python
classes. For example, the lastal task could be incorporated into a doit file like so:
```Python from shmlast.last import lastal_task
def tasklastal(): return lastaltask('query.fna', 'db.faa', translate=True) ```
Known Issues
There is currently an issue with IUPAC codes in RNA. This will be fixed soon.
Contributing
See CONTRIBUTING.md for guidelines.
References
Aubry S, Kelly S, Kümpers BMC, Smith-Unna RD, Hibberd JM (2014) Deep Evolutionary Comparison of Gene Expression Identifies Parallel Recruitment of Trans-Factors in Two Independent Origins of C4 Photosynthesis. PLoS Genet 10(6): e1004365. doi:10.1371/journal.pgen.1004365
O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.
Kiełbasa, S. M., Wan, R., Sato, K., Horton, P., & Frith, M. C. (2011). Adaptive seeds tame genomic sequence comparison. Genome research, 21(3), 487-493.
Owner
- Name: Camille Scott
- Login: camillescott
- Kind: user
- Location: Davis, CA
- Website: http://www.camillescott.org
- Repositories: 41
- Profile: https://github.com/camillescott
Sys Admin @ucdavis High Performance Compute Core Facility; formerly @dib-lab
JOSS Publication
shmlast: An improved implementation of Conditional Reciprocal Best Hits with LAST and Python
Editor
George GithinjiTags
bioinformatics orthology alignment LAST BLAST pythonCodeMeta (codemeta.json)
{
"@context": "https://raw.githubusercontent.com/mbjones/codemeta/master/codemeta.json",
"@type": "Code",
"author": [
{
"@id": "http://orcid.org/0000-0001-8822-8779",
"@type": "Person",
"email": "camille.scott.w@gmail.com",
"name": "Camille Scott",
"affiliation": "University of California, Davis"
}
],
"identifier": "",
"codeRepository": "https://github.com/camillescott/shmlast",
"datePublished": "2016-11-21",
"dateModified": "2016-11-21",
"dateCreated": "2016-11-21",
"description": "An improved implementation of Conditional Reciprocal Best Hits with LAST and Python.",
"keywords": "bioinformatics, orthology, alignment, LAST, BLAST, python",
"license": "BSD-3-Clause",
"title": "shmlast",
"version": "v1.4"
}
Papers & Mentions
Total mentions: 1
The Oyster River Protocol: a multi-assembler and kmer approach for de novo transcriptome assembly
- DOI: 10.7717/peerj.5428
- OpenAlex ID: https://openalex.org/W2951182431
- Published: August 2018
GitHub Events
Total
- Watch event: 1
- Fork event: 1
Last Year
- Watch event: 1
- Fork event: 1
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Camille Scott | c****w@g****m | 158 |
| Luiz Irber | l****r@g****m | 1 |
| C. Titus Brown | t****s@i****g | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 7
- Total pull requests: 5
- Average time to close issues: 4 months
- Average time to close pull requests: about 1 month
- Total issue authors: 6
- Total pull request authors: 4
- Average comments per issue: 0.29
- Average comments per pull request: 1.6
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- ctb (2)
- camillescott (1)
- jefdaj (1)
- macmanes (1)
- hydrahamster (1)
- halexand (1)
Pull Request Authors
- ctb (2)
- johnsolk (1)
- bluegenes (1)
- luizirber (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 79 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 8
- Total maintainers: 1
pypi.org: shmlast
An improved implementation of Conditional Reciprocal Best Hits with LAST and Python.
- Homepage: https://github.com/camillescott/shmlast
- Documentation: https://shmlast.readthedocs.io/
- License: BSD
-
Latest release: 1.2.1
published almost 8 years ago
Rankings
Maintainers (1)
Dependencies
- doit >=0.29.0
- ficus >=0.5
- filelock >=2.0.6
- matplotlib >=1.5.1
- numpy >=1.9.0
- ope >=0.6
- pandas >=0.17.0
- pytest <4
- pytest-benchmark *
- scipy >=0.16.0
- screed >=0.9
- seaborn >=0.6.0
- doit >=0.29.0 development
- ficus >=0.5 development
- filelock >=2.0.6 development
- matplotlib >=1.5.1 development
- numpy >=1.9.0 development
- ope >=0.6 development
- pandas >=0.17.0 development
- psutil * development
- pytest <4 development
- pytest-benchmark * development
- scipy >=0.16.0 development
- screed >=0.9 development
- seaborn >=0.6.0 development
- doit *
- ficus *
- filelock *
- matplotlib *
- numpy *
- ope *
- pandas *
- scipy *
- screed *
- seaborn *
