2024-tick-sg-peptides-tsa

Analysis of tick salivary gland transcriptomes with peptigate for peptide discovery

https://github.com/arcadia-science/2024-tick-sg-peptides-tsa

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.4%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

Analysis of tick salivary gland transcriptomes with peptigate for peptide discovery

Basic Info
  • Host: GitHub
  • Owner: Arcadia-Science
  • License: agpl-3.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 60.1 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 3
  • Releases: 1
Created about 2 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

2024-tick-sg-peptides-tsa: Predicting peptide sequences from tick salivary gland transcriptomes in the Transcriptome Shotgun Assembly database

run with conda Snakemake

Purpose

This repository documents peptide discovery in tick salivary gland transcriptomes on the TSA. This repository is associated with the pub, "Predicting peptides from tick salivary glands that suppress host detection."

Installation, Setup, and Running the Pipeline

This repository uses Snakemake to run the pipeline and conda to manage software environments and installations. You can find operating system-specific instructions for installing miniconda here. After installing conda and mamba, run the following command to create the pipeline run environment.

{bash} mamba env create -n ticktides --file envs/dev.yml conda activate ticktides

Snakemake manages rule-specific environments via the conda directive and using environment files in the envs/ directory. Snakemake itself is installed in the main development conda environment as specified in the dev.yml file.

The pipeline progresses in three stages. In the first, tick transcriptomes are prepped for the peptigate pipeline using the prep-txomes-for-peptigate.snakefile. It uses the metadata file ticksgtranscriptomes_tsa.csv to download tick transcriptomes from the TSA and predict protein sequences with TransDecoder. It also produces config files for peptigate. This snakefile can be run with:

{bash} snakemake -s prep-txomes-for-peptigate.snakefile --software-deployment-method conda -j 8

Then, supply this processed data and config files to run peptigate. The peptigate pipeline predicts peptides (sORF and cleavage) from transcriptome assemblies. We run this as a separate step because peptigate is its own Snakemake workflow in a GitHub repository and is not currently installable. We ran peptigate from the commit 148823239aad41a8f03da37f5499b00c8a79de40. We cloned the repository, copied our config and input data files to the repo folder, and ran peptigate with this for loop:

{bash} for infile in input_configs/tsa_tick_sg_transcriptomes/*yml do snakemake --software-deployment-method conda -j 1 -k --configfile $infile done

We then transferred the results back to this repo into the outputs/tsa_tick_sg_transcriptomes folder and analyzed them with the snakefile analyze-peptigate-outputs.snakefile.

{bash} snakemake -s analyze-peptigate-outputs.snakefile --software-deployment-method conda -j 8 -k

Finally, we analyzed the results from this last snakefile using the notebooks in the notebooks directory. To run these notebooks, we use the tidyjupyter environment:

{bash} mamba env create -n tidyjupyter --file envs/tidyjupyter.yml conda activate tidyjupyter

Data

The data analyzed in this repository is recorded in in the CSV file tick_sg_transcriptomes_tsa.csv. We searched the NCBI Transcriptome Shotgun Assembly sequence database for salivary gland transcriptomes from tick species. The CSV file records the transcriptome accession numbers as well as metadata about the size of the transcriptome assembly.

Overview

This repository records peptide discovery and analysis from publicly available tick salivary gland transcriptomes. As documented above, the analysis proceeds in three parts, beginning with data acquisition, progressing to peptide prediction, and finishing with peptide analysis.

Description of the folder structure

  • envs/: Contains conda environment yaml files used by snakemake and to run the snakemake pipelines and notebooks.
  • inputs/: Contains input files for the analysis as well as models for tools used in the repository.
  • notebooks/: Contains jupyter notebooks that analyze the predicted peptides.
  • scripts/: Scripts executed by the snakemake pipelines.
  • LICENSE: Details re-use constraints.
  • README.md: Documents the project and provides run instructions.
  • analyze-peptigate-outputs.snakefile: Documents the steps taken to analyze (annotate and compare) the peptide sequences predicted by peptigate.
  • prep-txomes-for-peptigate.snakefile: Documents the steps taken to prepare the TSA transcriptomes for peptigate.
  • .github/, .vscode/, .gitignore, .pre-commit-config.yaml, Makefile, pyproject.toml: Snakemake template files that control the developer environment of the repository. See the Arcadia-Science/snakemake-template for more details.

Compute Specifications

  • prep-txomes-for-peptigate.snakefile: Ran on a MacBookPro 2021 with 64 Gb of RAM and running MacOS Ventura 13.4. We executed all commands in a terminal running Rosetta.
  • peptigate pipeline: Ran on an AWS EC2 instance type g4dn.2xlarge running AMI Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 20.04) 20240122 (AMI ID ami-07eb000b3340966b0). Note the pipeline runs many tools that use GPUs.
  • analyze-peptigate-outputs.snakefile: Ran on an AWS EC2 instance type g4dn.2xlarge running AMI Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 20.04) 20240122 (AMI ID ami-07eb000b3340966b0). Note the tool AutoPeptideML runs on a GPU.

Notes about prep-txomes-for-peptigate.snakefile

The TSA is notoriously unreliable for downloading files. While in theory one should be able to download transcriptome assemblies and predicted proteins directly from the TSA using NCBI's Entrez-Direct tool, in practice this approach is spotty due to outages. When using this approach, we received error message like the following for a subset of transcriptomes:

$ esearch -db nuccore -query GKHV01 | efetch -format fasta > GKHV01_fasta.fa WARNING: FAILURE ( Wed Mar 27 18:04:01 UTC 2024 ) nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -query_key 1 -WebEnv MCID_66045f91e16b2b7f5e329221 -retstart 0 -retmax 1 -db nuccore -rettype fasta -retmode text -tool edirect -edirect 21.6 -edirect_os Linux -email ubuntu@ip-172-31-6-147.us-west-1.compute.internal EMPTY RESULT SECOND ATTEMPT WARNING: FAILURE ( Wed Mar 27 18:04:04 UTC 2024 ) nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -query_key 1 -WebEnv MCID_66045f91e16b2b7f5e329221 -retstart 0 -retmax 1 -db nuccore -rettype fasta -retmode text -tool edirect -edirect 21.6 -edirect_os Linux -email ubuntu@ip-172-31-6-147.us-west-1.compute.internal EMPTY RESULT LAST ATTEMPT ERROR: FAILURE ( Wed Mar 27 18:04:06 UTC 2024 ) nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -query_key 1 -WebEnv MCID_66045f91e16b2b7f5e329221 -retstart 0 -retmax 1 -db nuccore -rettype fasta -retmode text -tool edirect -edirect 21.6 -edirect_os Linux -email ubuntu@ip-172-31-6-147.us-west-1.compute.internal EMPTY RESULT QUERY FAILURE

We received this error: * With and without using an NCBI API * With the entrez-direct tool installed via NCBI's installation method or via conda * On a local (MacOS) computer and on an AWS EC2 instance.

As such, we provide a backup url from which to download the transcriptome assemblies. This two-step download approach is baked into the Snakefile. Further, while one can programmatically download protein predictions in amino acid or nucleotide format if they are available using entrez-direct, the TSA only provides links for the contigs in nucleotide format or the proteins in amino acid format. Given this, and how spotty entrez-direct coverage of the TSA is, we chose to predict proteins using TransDecoder for all transcriptomes, whether they have annotations available or not. These decisions are documented in docstrings in the Snakefile and executed with the code itself.

Overview of results

The results covered here are documented in greater detail in the analysis notebooks in the notebooks folder.

Tick salivary gland transcriptomes contain thousands of predicted peptides, many of which have predicted anti-inflammatory or anti-pruritic bioactivity. We predicted peptide sequences from 29 publicly available tick salivary gland transcriptomes as well as the A. americanum (whole body, midgut, and salivary gland) transcriptome assembled in a previous pilot. In total, peptigate predicted 226,538 peptides (17,928 cleavage, 208,610 sORF) from 19 tick species from the genera Amblyomma, Hyalomma, Ixodes, Ornithodoros, and Rhipicephalus.

Peptides with predicted anti-inflammatory bioactivity

See this notebook for more information. We predicted that 5,142 distinct peptide sequences (2,320 cleavage, 3,822 sORF) had anti-inflammatory bioactivity (see this issue for how we predicted anti-inflammatory bioactivity). The machine learning model we used had a 71% accuracy rate. Given this low accuracy rate and the high number of predictions, we struggled with paring down this list of peptides to hone in on those worth experimentally validating from this data alone.

Peptides that are similar to known peptides with antipruritic activity

See this notebook for more information.

There are very few peptides with evidence of anti-pruritic effects so we could not create a machine learning model to identify this bioactivity. Instead, we BLASTp’d our peptide predictions against a database of protein sequences for four peptides with evidence of anti-pruritic activity: calcitonin gene-related peptide, dynorphin, tachykinin-4, and ziconotide. We also BLASTp'd against votuclais, a small tick protein that sequesters histamine; votucalis is not a peptide, as it is greater than 100 amino acids.

We identified 106 peptides (2 cleavage, 104 sORF) from 16 species that had hits to anti-pruritic peptides, the majority of which matched calcitonin gene-related peptide. About 70% of these sequences had hits against the Human Peptide Atlas, indicating that they might have homology (and shared function) with human peptides. (Note we assume this is so high because we used BLAST to detect sequences of interest in the first place). We again clustered all predicted peptides at 80% sequence identity and joined this information to our anti-pruritic peptide predictions; in total, the 106 peptides belonged to 92 clusters, suggesting that we recovered largely independent sequences.

Contributing

See how we recognize feedback and contributions to our code.

Owner

  • Name: Arcadia Science
  • Login: Arcadia-Science
  • Kind: organization
  • Location: United States of America

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite the associated publication.
title: Predicting peptides from tick salivary glands that suppress host detection
doi: 10.57844/arcadia-gfhy-d2f3
authors:
- family-names: Bell
  given-names: Audrey
  affiliation: Arcadia Science
  orcid: https://orcid.org/0009-0008-2270-1613
- family-names: Borges
  given-names: Adair L.
  affiliation: Arcadia Science
  orcid: https://orcid.org/0000-0002-1477-7908
- family-names: Cheveralls
  given-names: Keith
  affiliation: Arcadia Science
  orcid: https://orcid.org/0000-0002-4157-6087
- family-names: Chou
  given-names: Seemay
  affiliation: Arcadia Science
  orcid: https://orcid.org/0000-0002-7271-303X
- family-names: Donnelly
  given-names: Justin
  affiliation: Arcadia Science
  orcid: https://orcid.org/0009-0000-1480-7372
- family-names: Hochstrasser
  given-names: Megan L.
  affiliation: Arcadia Science
  orcid: https://orcid.org/0000-0002-4404-078X
- family-names: McDaniel
  given-names: Elizabeth A.
  affiliation: Arcadia Science
  orcid: https://orcid.org/0000-0003-4692-0913
- family-names: Reiter
  given-names: Taylor
  affiliation: Arcadia Science
  orcid: https://orcid.org/0000-0002-7388-421X
- family-names: Weiss
  given-names: Emily C.P.
  affiliation: Arcadia Science
  orcid: https://orcid.org/0000-0002-9960-0270
preferred-citation:
  title: Predicting peptides from tick salivary glands that suppress host detection
  type: article
  doi: 10.57844/arcadia-gfhy-d2f3
  authors:
  - family-names: Borges
    given-names: Adair L.
    affiliation: Arcadia Science
    orcid: https://orcid.org/0000-0002-1477-7908
  - family-names: Chou
    given-names: Seemay
    affiliation: Arcadia Science
    orcid: https://orcid.org/0000-0002-7271-303X
  - family-names: Reiter
    given-names: Taylor
    affiliation: Arcadia Science
    orcid: https://orcid.org/0000-0002-7388-421X
  - family-names: Weiss
    given-names: Emily C.P.
    affiliation: Arcadia Science
    orcid: https://orcid.org/0000-0002-9960-0270
  year: 2025

GitHub Events

Total
  • Release event: 1
  • Delete event: 1
  • Push event: 2
  • Public event: 1
  • Pull request event: 1
  • Pull request review event: 1
  • Create event: 2
Last Year
  • Release event: 1
  • Delete event: 1
  • Push event: 2
  • Public event: 1
  • Pull request event: 1
  • Pull request review event: 1
  • Create event: 2

Dependencies

.github/workflows/lint.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
.github/workflows/run-demo-pipeline.yml actions
  • actions/checkout v3 composite
  • snakemake/snakemake-github-action v1.24.0 composite
pyproject.toml pypi