plncpro

A machine learning model for the prediction of lncRNAs (Singh et. al NAR 2017)

https://github.com/urmi-21/plncpro

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.8%) to scientific vocabulary

Keywords

bioinformatics lncrna prediction random-forest

Last synced: 6 months ago · JSON representation

Repository

A machine learning model for the prediction of lncRNAs (Singh et. al NAR 2017)

Basic Info

Host: GitHub
Owner: urmi-21
License: other
Language: Python
Default Branch: master
Homepage: http://ccbb.jnu.ac.in/plncpro/
Size: 43.5 MB

Statistics

Stars: 8
Watchers: 1
Forks: 5
Open Issues: 2
Releases: 0

Topics

bioinformatics lncrna prediction random-forest

Created almost 9 years ago · Last pushed about 4 years ago

Metadata Files

Readme License

README.md

PyPI - Python Version

                      _____  _            _____  _____   ____  
                     |  __ \| |          |  __ \|  __ \ / __ \ 
                     | |__) | |_ __   ___| |__) | |__) | |  | |
                     |  ___/| | '_ \ / __|  ___/|  _  /| |  | |
                     | |    | | | | | (__| |    | | \ \| |__| |
                     |_|    |_|_| |_|\___|_|    |_|  \_\\____/

INTRODUCTION

PlncPRO (Plant Long Non-Coding rna Prediction by Random fOrests) is a program to classify coding (mRNAs) and long non-coding transcripts (lncRNAs). Our method is based on random forest method and uses protein homology search, sequence based and 3-mer frequency based features. We have developed predictive models for several plant species to predict lncRNAs. We comprehensively tested our method on plants and vertebrates and found that our model works better as compared to the existing tools.

Citation

Singh et. al. PLncPRO for prediction of long non-coding RNAs (lncRNAs) in plants and its application for discovery of abiotic stress-responsive lncRNAs in rice and chickpea. Nucleic Acids Res. 2017 Dec 15;45(22):e183. doi: 10.1093/nar/gkx866.

NOTE: We have updated PlncPRO for python3. PlncPRO for python2 is also available at http://ccbb.jnu.ac.in/plncpro/. Usage for this newer version is different from the older versions.

INSTALLATION

Pre-requisite:

OS: Linux, macOS
Python 3.6 or later versions (http://www.python.org/)
NCBI BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi)
GNU C Library (glibc >= 2.14)

python dependencies

NumPy (http://www.numpy.org/)
SciPy (https://www.scipy.org/)
Scikit-learn (http://scikit-learn.org/)
Biopython (http://biopython.org/)
regex

Install through bioconda

conda install -c bioconda plncpro

Using PIP

pip install plncpro

From source

git clone https://github.com/urmi-21/PLncPRO.git pip install PlncPro

Run tests

bash tests/local_test.sh

Basic Usage

See examples for detailed usage examples.

`plncpro predict`

Label lncRNAs and mRNAs. This file reads an input file containing sequences and then classifies the sequences as coding or non-coding. It uses a model generated by build.py to make classifications. It outputs a file containing class label and class probabilities for each sequence.

plncpro predict -i <input fasta> -o <output_dir> -p <output_file_name> -t 2 -d <blast_db> -m <model_file>

PARAMETERS

-p,--prediction_out output file name -i,--infile file containing input sequences -m,--model model file -o,--outdir output directory name -d,--db path to blast database OPTIONAL -t,--threads number of threads [default: 4] -l,--labels path to the files containg labels(it outputs classification accuracy) -r,--remove_temp clean up intermediate files -v,--verbose show more messages --min_len specifiy min_length to filter input files --noblast Don't use blast features -no_ff Don't use framefinder features --qcov_hsp specify query coverage parameter for blast[default:30] --blastres* path to blast output for input file *blast result should be in following format: -outfmt '6 qseqid sseqid pident evalue qcovs qcovhsp score bitscore qframe sframe'

`plncpro build`

Build model using the given training data (mRNA/lncRNA transcripts). This file reads two labelled datasets containing coding and non-coding transcripts. Then it makes a random forest based classification model and saves the model, which can be used to predict unknown sequences.

plncpro build -p <mrna fasta> -n <lncrna fasta> -o <out_dir> -m <model_name> -d <blast db> -t <threads>

PARAMETERS

``` -p,--pos file containing mRNA sequences -n,--neg file containing lncRNA sequences -m,--model output model name -o,--outdir output directory name -d path to blast database OPTIONAL -t,--threads number of threads [default: 4] -k,--numtrees number of trees[default: 1000] -r,--removetemp clean up intermediate files -v,--verbose show more messages
--minlen specifiy minlength to filter input files --noblast Don't use blast features --noff Don't use framefinder features --qcovhsp specify query cov parameter for blast[default:30] --posblastres* path to blast result for mRNA input file --negblastres* path to blast result for lncRNA input file

*blast result should be in following format: -outfmt '6 qseqid sseqid pident evalue qcovs qcovhsp score bitscore qframe sframe' ```

plncpro predtoseq

Extract mRNA or lncRNA sequences from PLNCPRO output file. This file reads a prediction output file and extracts sequences from a given class. User can specify class and probability cut-off and extract desired transcript sequences.

plncpro predtoseq -f <fasta_file> -o <outputfile> -p <PLNCPRO_prediction_file> -l <required_label>

PARAMETERS

-f input fasta file name -o output fasta file name -p path to file containg predictions by PLNCPRO OPTIONAL -l label of the required sequences (0 for lncRNA;1 for mRNA) [default:0] -s class probability cutoff (extract sequences with probability greater than or equal to s --min specifiy min_length of sequences[default:0] --max specifiy min_length of sequences[default:Inf]

Using Diamond instead of blastx

Diamond is several folds faster than blastx and could be used instead of blastx. To use diamond with plncpro, first run diamond using following output parameters:

diamond blastx -d <diamondDB> -q <query.fasta> -o <diamond_out> --outfmt 6 qseqid sseqid pident evalue nident qcovhsp score bitscore qframe qstrand Then pass the diamond output using the --blastres parameter to plncpro, e.g.:

plncpro predict -i <input.fasta> -o <outfile> -p <preds> --blastres <diamond_out> -m <model>

Download data used in paper

Data is hosted on google drive. Direct link

Directly download using wget. wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=108S-9Bt4CLCHTaCn6-HKTqQZDo0nssZe' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=108S-9Bt4CLCHTaCn6-HKTqQZDo0nssZe" -O plncpro_data.zip && rm -rf /tmp/cookies.txt

COPYING

GNU Public License version 3 (GPLv3) Details on http://www.gnu.org/copyleft/gpl.html

Owner

Name: Urminder Singh
Login: urmi-21
Kind: user

Website: https://urmi-21.github.io/
Repositories: 48
Profile: https://github.com/urmi-21

Bioinformatics Scientist

GitHub Events

Total

Last Year

Committers

Last synced: about 2 years ago

All Time

Total Commits: 86
Total Committers: 2
Avg Commits per committer: 43.0
Development Distribution Score (DDS): 0.023

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
urmi-21	m**1@g**m	84
usingh	u**h@l**n	2

Committer Domains (Top 20 + Academic)

localhost.localdomain: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 2
Total pull requests: 1
Average time to close issues: about 1 month
Average time to close pull requests: N/A
Total issue authors: 2
Total pull request authors: 1
Average comments per issue: 0.5
Average comments per pull request: 3.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

jamesonypy (1)
zhuqingquan5510 (1)

Pull Request Authors

nicolasDelhomme (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 12 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 3
Total maintainers: 1

pypi.org: plncpro

PlncPRO (Plant Long Non-Coding rna Prediction by Random fOrests) is a program to classify coding (mRNAs) and long non-coding transcripts (lncRNAs).

Homepage: https://github.com/urmi-21/PLncPRO
Documentation: https://plncpro.readthedocs.io/
License: MIT License
Latest release: 1.2.2
published about 6 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 12 Last month

Rankings

Dependent packages count: 10.1%

Forks count: 14.2%

Stargazers count: 20.4%

Average: 21.4%

Dependent repos count: 21.6%

Downloads: 40.9%

Maintainers (1)

urmi21

Last synced: 6 months ago

Dependencies

requirements.txt pypi

biopython *
regex *
sklearn *

setup.py pypi

line.rstrip *