2024-chelicerate-phylogenomics-peptides

Analyzes the outputs of the chelicerate noveltree run and subsequent trait-mapping for association with host detection suppression (phylo-profiling) and predicts peptides from those proteins

https://github.com/arcadia-science/2024-chelicerate-phylogenomics-peptides

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.6%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

Analyzes the outputs of the chelicerate noveltree run and subsequent trait-mapping for association with host detection suppression (phylo-profiling) and predicts peptides from those proteins

Basic Info
  • Host: GitHub
  • Owner: Arcadia-Science
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 102 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 1
  • Releases: 1
Created almost 2 years ago · Last pushed 11 months ago
Metadata Files
Readme License Citation

README.md

Predicting peptides from chelicerate proteins associated with host detection suppression

run with conda Snakemake

Purpose

This repository predicts peptides from chelicerate species that are associated with host detection suppression. In this context, we refer to suppression of the triad of inflammation, pain, and itch as "host detection suppression." This repository is associated with the pub, "Predicting peptides from tick salivary glands that suppress host detection."

Installation and Setup

This repository uses Snakemake to run the pipeline and conda to manage software environments and installations. You can find operating system-specific instructions for installing miniconda here. After installing conda and mamba, run the following command to create the pipeline run environment.

{bash} mamba env create -n ticktides --file envs/dev.yml conda activate ticktides

Snakemake manages rule-specific environments via the conda directive and using environment files in the envs/ directory. Snakemake itself is installed in the main development conda environment as specified in the dev.yml file.

To start the pipeline, run:

{bash} snakemake -s analyze-peptigate-outputs.snakefile --software-deployment-method conda -j 8

Data

This repository analyzes the outputs of four previous analyses to predict peptide sequences from proteins that are associated with host detection suppression. These analyses were previous pilots or work done by others at Arcadia. The four upstream analyses are: 1. protein-data-curation: This repository provides a workflow to download, process, and annotate genomic and transcriptomic data that can then be fed to NovelTree (see next point). The input sequences are all from chelicerate species and are documented here. This repository also annotated these sequences, and that metadata is now incorporated into the file 2024-06-26-top-positive-significant-clusters-orthogroups-annotations.tsv.gz. 2. Chelicerate NovelTree run: NovelTree applied to the curated chelicerate dataset. The input dataset to NovelTree is protein sequences (either genome gene annotations or open reading frames predicted from transcriptomes). NovelTree applies evolutionary analyses to identify proteins that are "novel" under models of speciation, loss, or transfer. 3. Chelicerate trait mapping: trait mapping applied to the results of the chelicerate NovelTree run to identify proteins (from clusters of orthologous groups) that are associated with the host detection-suppression trait. The protein sequences associated with host detection suppression are recorded in the file 2024-06-26-top-positive-significant-clusters-orthogroups-proteins.fasta.gz. 4. Tick salivary gland transcriptome peptides: peptide sequences predicted from publicly available tick salivary gland transcriptomes. The results from this analysis are in the file tsasgpeptides.faa.gz.

Overview

Using the sequences of proteins predicted to be associated with the host detection-suppression trait, we predicted peptide sequences from these proteins. We define peptides as any protein sequence of less than or equal to 100 amino acids in length.

The peptigate pipeline predicted peptides from transcriptome assemblies. It had two prediction modules that targeted small open reading frames (sORFs) and cleavage peptides, respectively. Because we had access to protein sequences but not to transcriptome assemblies, we modified our peptide prediction approach to retrieve as many peptides as possible given our input data. We: 1. sORF prediction: filtered the host detection-suppression-associated proteins to those that were 100 amino acids or less. This was where we suffered the most potential loss in our predictive capabilities -- we think we'd have many more potential sORF sequences if we had transcriptomes as input instead of protein sequences. However, these would require a new trait mapping analysis. 2. Cleavage peptide prediction: ran the cleavage peptide prediction portion of the peptigate pipeline on protein sequences longer than 100 amino acids. 3. Peptide annotation: annotated the predicted peptides with the annotation and analysis portion of the peptigate pipeline.

We ran peptigate from commit 148823239aad41a8f03da37f5499b00c8a79de40 with the following command:

snakemake -s protein_as_input.snakefile --software-deployment-method conda -j 1 -k --configfile input_configs/tot_protein_peptigate_config.yml

where the tot_protein_peptigate_config.yml looked like: input_dir: "inputs/" output_dir: "outputs/ToT_20240626" orfs_amino_acids: "input_data/ToT_20240626/2024-06-26-top-positive-significant-clusters-orthogroups-proteins.fasta.gz"

With these peptide predictions, we then proceeded with the rest of the analysis.

We ran additional annotation and comparison analysis with the snakefile in this repository:

snakemake -s analyze-peptigate-outputs.snakefile --software-deployment-method conda -j 8

This snakefile: * Compares (BLASTp) peptide predictions against peptides in the Human Peptide Atlas. The idea is that if a peptide resembles a human peptide, it may have evolved to mimic a human process or to interact with human molecules. * Compares (BLASTp) peptide predictions against peptides predicted in tick salivary glands. We want to target peptides in tick salivary glands because this is where we expect to see most molecules that have evolved to influence human biology. * Compares (BLASTp) peptide predictions against peptides with known anti-pruritic effects. We only have 5 peptides (see the anti-pruritic peptides folder) that are known to have anti-host detection activity, but we compare against them. * Clusters (MMseqs2, 80% identity) peptide predictions to determine how similar different peptides are to each other. 80% is somewhat arbitrary, but it's commonly used in machine learning algorithms as a cutoff for shared information between proteins. * Predicts anti-inflammatory bioactivity (AutoPeptideML) for peptide predictions. The model that does this has 70% accuracy. We expect there to be some overlap between inflammation suppression and host detection suppression, so we include this information.

Last, we ran the following notebooks to combine and filter the results to produce the most promising candidates for peptides that may suppress host detection. We ran the notebooks using the envs/tidyjupyter.yml environment.

Results

We started with 3,690 input protein sequences from 87 orthogroups. We initially predicted 741 peptides (712 distinct sequences) in 46 orthogroups. We applied the following initial filters (see this notebook): * Filtered (removed) propeptides predicted by DeepPeptide. DeepPeptide uses the UniProt definition of a propeptide, a part of a protein that's cleaved during maturation or activation. Once cleaved, a propeptide generally has no independent biological function. * Filtered (removed) peptides in orthogroups where no peptide had a hit to a predicted peptide from a tick salivary gland transcriptome. We want to target things expressed in the tick salivary gland because they're more likely to be biologically active in host detection suppression. However, we don’t know if the tick salivary gland transcriptomes we worked with are complete (many are heavily filtered) so we relaxed this filter to function at the orthogroup level.

These filters reduced the number of peptides to 314 peptides (311 distinct sequences) in 16 orthogroups.

Using these sequences, we then further filtered to select the sequences most likely to suppress host detection and easiest to work with for experimental validation (see this notebook): * Filtered to orthogroups where the majority of proteins had a predicted peptide. We were most interested in orthogroups where the majority of predicted peptides were of the same type (sORF or cleavage), although we did not use this as a strict filtering class. * Filtered to orthogroups where the majority of peptides (sORF) or parent proteins (cleavage) had signal peptides. We only kept peptides with signal peptides.

These filters gave us a set of 89 peptides from 3 orthogroups. We then selected peptides within each of these orthogroups to synthesize. We made selections based on ease of synthesis & solubility, similar peptide expressed in tick salivary gland transcriptome, and similarity it other peptides in the group. We ended up with 12 peptides (5 from OG0008102, 3 from OG0001774, 4 from OG0000880).

Compute Specifications

  • peptigate pipeline: Ran on an AWS EC2 instance type g4dn.2xlarge running AMI Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 20.04) 20240122 (AMI ID ami-07eb000b3340966b0). Note the pipeline runs many tools that use GPUs.
  • analyze-peptigate-outputs.snakefile: Ran on an AWS EC2 instance type g4dn.2xlarge running AMI Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 20.04) 20240122 (AMI ID ami-07eb000b3340966b0). One tool in the pipeline, AutoPeptideML, uses a GPU.
  • Notebooks: Ran on a MacBook Pro (2021, M1) using operating system Ventura 13.4 and with 64GB of RAM.

Contributing

See how we recognize feedback and contributions to our code.

Owner

  • Name: Arcadia Science
  • Login: Arcadia-Science
  • Kind: organization
  • Location: United States of America

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite the associated publication.
title: Predicting peptides from tick salivary glands that suppress host detection
doi: 10.57844/arcadia-gfhy-d2f3
authors:
- family-names: Bell
  given-names: Audrey
  affiliation: Arcadia Science
  orcid: https://orcid.org/0009-0008-2270-1613
- family-names: Borges
  given-names: Adair L.
  affiliation: Arcadia Science
  orcid: https://orcid.org/0000-0002-1477-7908
- family-names: Cheveralls
  given-names: Keith
  affiliation: Arcadia Science
  orcid: https://orcid.org/0000-0002-4157-6087
- family-names: Chou
  given-names: Seemay
  affiliation: Arcadia Science
  orcid: https://orcid.org/0000-0002-7271-303X
- family-names: Donnelly
  given-names: Justin
  affiliation: Arcadia Science
  orcid: https://orcid.org/0009-0000-1480-7372
- family-names: Hochstrasser
  given-names: Megan L.
  affiliation: Arcadia Science
  orcid: https://orcid.org/0000-0002-4404-078X
- family-names: McDaniel
  given-names: Elizabeth A.
  affiliation: Arcadia Science
  orcid: https://orcid.org/0000-0003-4692-0913
- family-names: Reiter
  given-names: Taylor
  affiliation: Arcadia Science
  orcid: https://orcid.org/0000-0002-7388-421X
- family-names: Weiss
  given-names: Emily C.P.
  affiliation: Arcadia Science
  orcid: https://orcid.org/0000-0002-9960-0270
preferred-citation:
  title: Predicting peptides from tick salivary glands that suppress host detection
  type: article
  doi: 10.57844/arcadia-gfhy-d2f3
  authors:
  - family-names: Borges
    given-names: Adair L.
    affiliation: Arcadia Science
    orcid: https://orcid.org/0000-0002-1477-7908
  - family-names: Chou
    given-names: Seemay
    affiliation: Arcadia Science
    orcid: https://orcid.org/0000-0002-7271-303X
  - family-names: Reiter
    given-names: Taylor
    affiliation: Arcadia Science
    orcid: https://orcid.org/0000-0002-7388-421X
  - family-names: Weiss
    given-names: Emily C.P.
    affiliation: Arcadia Science
    orcid: https://orcid.org/0000-0002-9960-0270
  year: 2025

GitHub Events

Total
  • Release event: 1
  • Delete event: 1
  • Push event: 2
  • Public event: 1
  • Pull request event: 1
  • Pull request review event: 1
  • Create event: 1
Last Year
  • Release event: 1
  • Delete event: 1
  • Push event: 2
  • Public event: 1
  • Pull request event: 1
  • Pull request review event: 1
  • Create event: 1

Dependencies

.github/workflows/lint.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
.github/workflows/run-demo-pipeline.yml actions
  • actions/checkout v3 composite
  • snakemake/snakemake-github-action v1.24.0 composite
pyproject.toml pypi