resistify

Resistify is a program which rapidly identifies and classifies plant resistance genes from protein sequences. It is designed to be lightweight and easy to use.

https://github.com/swiftseal/resistify

Last synced: 7 months ago · JSON representation

Repository

Resistify is a program which rapidly identifies and classifies plant resistance genes from protein sequences. It is designed to be lightweight and easy to use.

Basic Info

Host: GitHub
Owner: SwiftSeal
License: gpl-3.0
Language: Python
Default Branch: main
Homepage:
Size: 116 MB

Statistics

Stars: 39
Watchers: 2
Forks: 3
Open Issues: 3
Releases: 24

Created over 2 years ago · Last pushed 7 months ago

Metadata Files

Readme License Citation

# Resistify ![Conda Version](https://img.shields.io/conda/vn/bioconda/resistify) ![Conda Downloads](https://img.shields.io/conda/dn/bioconda/resistify) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/swiftseal/resistify/blob/main/assets/resistify.ipynb) [![Pixi Badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/prefix-dev/pixi/main/assets/badge/v0.json)](https://pixi.sh) *Resistify can now make plots! Try `resistify draw`*

Resistify is a program which rapidly identifies and classifies plant resistance genes from protein sequences. It is designed to be lightweight and easy to use.

A screenshot of the help interface of resistify

Getting started

Resistify is available via the Bioconda channel:

conda create -n resistify resistify conda activate resistify

[!NOTE] If you want to use the GPU-accelerated pipelines, conda may fail to install a GPU-ready version of pytorch. If this occurs, try installing pytorch-gpu resistify instead.

Containers are also available through the biocontainers repository. To use these with singularity, simply run:

singularity exec docker://quay.io/biocontainers/resistify:<tag-goes-here> resistify

Usage

Identifying NLRs

To predict NLRs within a set of protein sequences, simply run:

resistify nlr <input.fa> -o $RESULTS_DIR

and Resistify will identify and classify NLRs, and return some files: - results.tsv - A table containing the primary results of Resistify. - motifs.tsv - A table of all the NLR motifs identified for each sequence. - domains.tsv - A table of all the domains identified for each sequence. - annotations.tsv - A table of the raw annotations for each sequence. - nbarc.fasta - A fasta file of all the NB-ARC domains identified. - nlr.fasta - A fasta file of all NLRs identified.

By default, Resistify will only return sequences with NB-ARC domains. If you wish to identify highly fragmented NLRs, you can use the --retain option which will predict and report NLR-associated motifs for all sequences. It'll be a bit slower!

If you want to increase the sensitivity of coiled-coil domain annotation, you can use the option --coconat. This will use CoCoNat to predict coiled-coil domains. In practice, I wouldn't expect this mode to pick up on a significant number of missed CC domains, but it can pick up on cryptic CCs that do not have an identifiable EDVID motif.

How does it work?

Resistify carries out an initial search for common NLR domains to quickly filter and annotate the input sequences. Then, Resistify executes a re-implementation of NLRexpress to conduct a highly accurate search for NLR-associated motifs. If --coconat is used, this will also be executed to scavenge for potentially missed coiled-coil domains. Together, this evidence is used to classify NLRs according to their domain architecture.

Identifying PRRs

[!IMPORTANT] This pipeline is currently in development - due to other commitments I can't currently benchmark this properly and make no guarantees to its accuracy yet! Feedback is appreciated.

To predict PRRs within a set of protein sequences, simply run:

resistify prr <input.fa> -o $RESULTS_DIR

and Resistify will identify and classify PRRs, and return some files: - results.tsv - A table containing the primary results of Resistify. - motifs.tsv - A table of all the LRR motifs identified for each sequence. - domains.tsv - A table of all the domains identified for each sequence. - annotations.tsv - A table of the raw annotations for each sequence. - prr.fasta - A fasta file of all PRRs identified.

[!WARNING] This pipeline is GPU-accelerated and will be slow on CPU only.

How does it work?

First, Resistify searches for domains associated with a recently described classification system for RLP/RLKs. Then, a re-implementation of TMbed is used to predict transmembrane domains - sequences with a single -helix transmembrane domain and an extracellular domain of at least 50 amino acids are considered as RLPs. Finally, NLRexpress is used to to identify LRR domains.

Sequences are classified as being either RLPs or RLKs depending on the presence of an internal kinase domain, and are classified according to their extracellular domain.

Downloading model data

[!NOTE] This only applies to the --coconat and PRR pipelines! The standard NLR pipeline does not require any external databases.

By default, resistify will automatically download models to $HOME/.cache when required. This default can be changed by adjusting the environment variables $HF_HOME and $TORCH_HOME to your preferred location. If you need to download these prior (e.g. if running resistify as part of a pipeline) you can use the download_models utility.

Results

results.tsv (nlr)

| Sequence | Length | LRRLength | Motifs | Domains | Classification | NBARCmotifs | MADA | MADAL | CJID | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ZAR1 | 852 | 307 | CNNNNNNNNNLLLLLLLLLL | mCNL | CNL | 9 | False | True | False |

The main column of interest is "Classification", where we can see that it has been identified as a canonical CNL. The "Motifs" column indicates the series of NLR-associated motifs identified across the sequence - this can be useful if an NLR has an undetermined or unexpected classification. The columns "MADA", "MADAL", and "CJID" correspond to common NLR sequence signatures. Here, it appears that ZAR1 has a MADA-like motif.

results.tsv (prr)

| Sequence | Length | ExtracellularLength | LRRLength | Type | Classification | Signal_peptide | | --- | --- | --- | --- | --- | --- | --- | | fls2 | 1173 | 806 | 675 | RLK | LRR | True |

For PRRs, sequences can be of the type RLP or RLK - both are single pass transmembrane proteins, and RLKs have an internal kinase domain. Classification refers to the domains identified in the external region. If multiple domains are identified, they will each be reported as a semi-colon separated list. If a signal peptide is identified in the sequence, this is reported accordingly.

motifs.tsv

| Sequence | Motif | Position | Probability | Downstreamsequence | Motifsequence | Upstream_sequence | | --- | --- | --- | --- | --- | --- | --- | | ZAR1 | extEDVID | 65 | 0.9974 | LVADL | RELVYEAEDILV | DCQLA | | ZAR1 | VG | 159 | 0.9924 | YDHTQ | VVGLE | GDKRK | | ZAR1 | P-loop | 188 | 1.0 | IMAFV | GMGGLGKTT | IAQEV | | ZAR1 | RNSB-A | 211 | 0.9981 | EIEHR | FERRIWVSVS | QTFTE | | ZAR1 | Walker-B | 259 | 0.973 | QYLLG | KRYLIVMD | DVWDK | | ZAR1 | RNSB-B | 290 | 0.9846 | RGQGG | SVIVTTR | SESVA | | ZAR1 | RNSB-C | 317 | 0.9994 | HRPEL | LSPDNSWLLF | CNVAF | | ZAR1 | RNSB-D | 417 | 0.9875 | SHLKS | CILTLSLYP | EDCVI | | ZAR1 | GLPL | 356 | 0.9998 | VTKCK | GLPLT | IKAVG | | ZAR1 | MHD | 486 | 0.9965 | IITCK | IHD | MVRDL | | ZAR1 | LxxLxL | 511 | 0.9398 | PEGLN | CRHLGI | SGNFD | | ZAR1 | LxxLxL | 560 | 0.9973 | TDCKY | LRVLDI | SKSIF | | ZAR1 | LxxLxL | 587 | 0.9993 | ASLQH | LACLSL | SNTHP | | ZAR1 | LxxLxL | 611 | 0.9995 | EDLHN | LQILDA | SYCQN | | ZAR1 | LxxLxL | 635 | 0.999 | VLFKK | LLVLDM | TNCGS | | ZAR1 | LxxLxL | 685 | 0.9987 | KNLTN | LRKLGL | SLTRG | | ZAR1 | LxxLxL | 712 | 0.9723 | INLSK | LMSISI | NCYDS | | ZAR1 | LxxLxL | 740 | 0.9995 | TPPHQ | LHELSL | QFYPG | | ZAR1 | LxxLxL | 765 | 0.9976 | HKLPM | LRYMSI | CSGNL | | ZAR1 | LxxLxL | 817 | 0.9391 | QSMPY | LRTVTA | NWCPE |

Here, the positions, probabilities, and sequence of NLRexpress motif hits are listed. The five amino acids upstream and downstream of the motif site are also provided. In PRR mode, only LRR motifs will be reported.

domains.tsv

| Sequence | Domain | Start | End | | --- | --- | --- | --- | | ZAR1 | MADA | 0 | 21 | | ZAR1 | CC | 4 | 129 | | ZAR1 | NB-ARC | 162 | 410 | | ZAR1 | LRR | 511 | 817 |

This file contains the coordinates of the domains identified by Resistify.

annotations.tsv

| Sequence | Domain | Start | End | E_value | Score | Source | | --- | --- | --- | --- | --- | --- | --- | | ZAR1 | MADA | 0 | 21 | 1.5e-06 | 16.2 | HMM | | ZAR1 | CC | 4 | 128 | 2.3e-23 | 70.0 | HMM | | ZAR1 | CC | 27 | 48 | NA | NA | Coconat | | ZAR1 | CC | 60 | 75 | NA | NA | Coconat | | ZAR1 | CC | 113 | 129 | NA | NA | Coconat | | ZAR1 | NB-ARC | 162 | 410 | 1.4e-89 | 287.2 | HMM | | ZAR1 | LRR | 511 | 817 | NA | NA | NLRexpress |

This file contains the raw annotations for each sequence, and the method which was used to identify them.

Result visualisation

Domain visualisation

Often, it can be quite useful to visualise the domain structure of an NLR/PRR. For this purpose, I have added a new submodule called resistify draw which lets you quickly draw the results of a completed run. Simply point it at a completed results directory, and it will produce a plot for all sequences:

resistify draw nlr_results/

There are a couple of customisation options, such as --height and --width to change the plot dimensions, --query which allows you to select a single or multiple sequences, and --hide-motifs which lets you hide the motif markup.

Phylogenetics

Resistify extracts the NB-ARC domains of each hit so we can easily build a phylogenetic tree. Here, we create a tree rooted on the NB-ARC domain of CED-4. The mafft | fastree method is used here for brevity rather than accuracy.

```{bash} echo -e ">ced4\nREYHVDRVIKKLDEMCDLDSFFLFLHGRAGSGKSVIASQALSKSDQLIGINYDSIVWLKDSGTAPKSTFDLFTDILLMLARVVSDTDDSHSITDFINRVLSRSEDDLLNFPSVEHVTSVVLKRMICNALIDRPNTLFVFDDVVQEETIRWAQELRLRCLVTTRDVEISNAASQTCEFIEVTSLEIDECYDFLEAYGMPMPVGEKEEDVLNKTIELSSGNPATLMMFFKSCEPKTFEKMAQLNNKLESRGLVGVECITPYSYKSLAMALQRCVEVLSDEDRSALAFAVVMPPGVDIPVKLWSCVIPVD" >> output/nbarc.fasta

mafft output/nbarc.fasta | fasttree > output/nbarc.tree ```

We can now plot the tree:

```{R} library(tidyverse) library(ggtree)

tree <- read.tree("output/nbarc.tree") tree <- treeio::root(tree, outgroup = "ced4")

results <- readtsv("output/results.tsv") |> mutate(Sequence = paste0(Sequence, "1"))

myplot <- ggtree(tree, layout = "circular") %<+% results

myplot <- myplot + geom_tippoint(aes(colour = Classification)) ```

Example plot of phylogenetic tree

Frequently asked questions

Q: Can Resistify be used to predict resistance genes from genomic data?

A: Unfortunately, Resistify cannot be directly applied to a genome to predict resistance genes, unlike tools such as NLR-Annotator. If gene annotations are unavailable for your genome, my advice would be to use a tool like Helixer or ANNEVO to perform ab initio gene prediction first, then pass these to Resistify. Currently, I find that Helixer tends to identify more NLRs than ANNEVO (in Solanum):

A barplot of the number of NLRs identified by Helixer vs ANNEVO

Q: According to the Motif string, some of my genes have NLR motifs in unexpected places - are these significant?

A: False positives do occur for the motif predictions, and unexpected predictions such as a single CC motif in the LRR domain are unlikely to be representative of a true domain annotation. You can find a figure of the prediction accuracy rates for each predictor here. False positives shouldn't interfere with the classification accuracy.

Q: The NLRexpress step is quite slow - how can I speed it up?

A: More threads! The process is relatively fast on a non-NLR sequence, but can be quite slow when applied to an NLR. Resistify will automatically use as many threads as possible - I've used up to 128 threads and it scales fairly well. It's primarily due to the underlying jackhmmer process, which is slow when applied to NLRs, but not non-NLRs. As a result the --retain option doesn't have as much of a performance impact as you might expect.

Benchmarks

The following are some quick benchmarks of the various resistify pipelines against the DM potato genome annotation, which contains 44,851 protein sequences.

| Pipeline | Resources | CPU time | Real time | MaxRSS | | --- | --- | --- | --- | --- | | nlr | 32T AMD EPYC 7543 | 05:14:47 | 00:12:42 | 15.0G | | nlr --retain | 32T AMD EPYC 7543 | 1-01:51:50 | 01:07:42 | 13.1G | | nlr --coconat | 32T AMD EPYC 7543 | 12:12:02 | 00:26:08 | 14.9G | | prr | 16T AMD EPYC 7543, NVIDIA A100 80GB | 23:14:05 | 00:59:40 | 8.4G |

Contributing

Contributions are greatly appreciated! If you experience any issues running Resistify, please get in touch via the Issues page. If you have any suggestions for additional features, get in touch!

Citing

Smith M., Jones J. T., Hein I. (2025) Resistify: A Novel NLR Classifier That Reveals Helitron-Associated NLR Expansion in Solanaceae. Bioinformatics and Biology Insights. 2025;19. doi:10.1177/11779322241308944

You must also cite:

Martin, E. C., Spiridon, L., Goverse, A., & Petrescu, A. J. (2022). NLRexpressA bundle of machine learning motif predictorsReveals motif stability underlying plant Nod-like receptors diversity. Frontiers in Plant Science, 13, 975888. https://doi.org/10.3389/fpls.2022.975888

If you use the CoCoNat module, please cite:

Madeo, G., Savojardo, C., Manfredi, M., Martelli, P. L., & Casadio, R. (2023). CoCoNat: a novel method based on deep learning for coiled-coil prediction. Bioinformatics, 39(8), btad495. https://doi.org/10.1093/bioinformatics/btad495

If you use the PRR module, please cite:

Bernhofer, M., & Rost, B. (2022). TMbed: transmembrane proteins predicted through language model embeddings. BMC bioinformatics, 23(1), 326. https://doi.org/10.1186/s12859-022-04873-x

Hall of fame

If you've used Resistify in your research, feel free to add it here!

Du, H., He, Y., Chen, M., Zheng, X., Gui, D., Tang, J., Fang, Y., Huang, Y., Wan, H., Ruan, J. and Jin, X., 2025. A near-complete genome assembly of Fragaria iinumae. BMC genomics, 26, p.253.

Liu, Z., Wang, X., Cao, S., Lei, T., Chenzhu, Y., Zhang, M., Liu, Z., Lu, J., Ma, W., Su, B. and Wang, Y., 2024. Deep learning facilitates precise identification of disease-resistance genes in plants. bioRxiv, pp.2024-09.

Owner

Name: Moray Smith
Login: SwiftSeal
Kind: user
Location: Dundee, Scotland

Website: swiftseal.github.io
Twitter: moray_smith
Repositories: 3
Profile: https://github.com/SwiftSeal

PhD student at the James Hutton Institute

GitHub Events

Total

Create event: 36
Issues event: 44
Release event: 15
Watch event: 7
Delete event: 23
Issue comment event: 73
Push event: 193
Pull request event: 51
Fork event: 1

Last Year

Create event: 36
Issues event: 44
Release event: 15
Watch event: 7
Delete event: 23
Issue comment event: 73
Push event: 193
Pull request event: 51
Fork event: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 14
Total pull requests: 18
Average time to close issues: 8 days
Average time to close pull requests: 1 day
Total issue authors: 8
Total pull request authors: 2
Average comments per issue: 2.07
Average comments per pull request: 0.11
Merged pull requests: 13
Bot issues: 0
Bot pull requests: 7

Past Year

Issues: 14
Pull requests: 18
Average time to close issues: 8 days
Average time to close pull requests: 1 day
Issue authors: 8
Pull request authors: 2
Average comments per issue: 2.07
Average comments per pull request: 0.11
Merged pull requests: 13
Bot issues: 0
Bot pull requests: 7

View more stats

Top Authors

Issue Authors

SwiftSeal (15)
hysong0921 (2)
panxinfeng661 (2)
zhangwenda0518 (1)
yilunhuangyue (1)
lixiang117423 (1)
YFKIB (1)
enriquepola1996 (1)
ttw-ymy (1)
slbai01 (1)
RiErm7 (1)
colindaven (1)
caiyinbi-2 (1)

Pull Request Authors

SwiftSeal (25)
pre-commit-ci[bot] (7)
colindaven (1)

Top Labels

Issue Labels

bug (10) enhancement (3) documentation (1) question (1)

Pull Request Labels

enhancement (1)

resistify

Science Score: 49.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Getting started

Usage

Identifying NLRs

How does it work?

Identifying PRRs

How does it work?

Downloading model data

Results

results.tsv (nlr)

results.tsv (prr)

motifs.tsv

domains.tsv

annotations.tsv

Result visualisation

Domain visualisation

Phylogenetics

Frequently asked questions

Benchmarks

Contributing

Citing

Hall of fame

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels