2024-tick-sg-peptides-tsa
Analysis of tick salivary gland transcriptomes with peptigate for peptide discovery
https://github.com/arcadia-science/2024-tick-sg-peptides-tsa
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: ncbi.nlm.nih.gov -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.4%) to scientific vocabulary
Repository
Analysis of tick salivary gland transcriptomes with peptigate for peptide discovery
Basic Info
- Host: GitHub
- Owner: Arcadia-Science
- License: agpl-3.0
- Language: Jupyter Notebook
- Default Branch: main
- Size: 60.1 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 3
- Releases: 1
Metadata Files
README.md
2024-tick-sg-peptides-tsa: Predicting peptide sequences from tick salivary gland transcriptomes in the Transcriptome Shotgun Assembly database
Purpose
This repository documents peptide discovery in tick salivary gland transcriptomes on the TSA. This repository is associated with the pub, "Predicting peptides from tick salivary glands that suppress host detection."
Installation, Setup, and Running the Pipeline
This repository uses Snakemake to run the pipeline and conda to manage software environments and installations. You can find operating system-specific instructions for installing miniconda here. After installing conda and mamba, run the following command to create the pipeline run environment.
{bash}
mamba env create -n ticktides --file envs/dev.yml
conda activate ticktides
Snakemake manages rule-specific environments via the conda directive and using environment files in the envs/ directory. Snakemake itself is installed in the main development conda environment as specified in the dev.yml file.
The pipeline progresses in three stages.
In the first, tick transcriptomes are prepped for the peptigate pipeline using the prep-txomes-for-peptigate.snakefile.
It uses the metadata file ticksgtranscriptomes_tsa.csv to download tick transcriptomes from the TSA and predict protein sequences with TransDecoder.
It also produces config files for peptigate.
This snakefile can be run with:
{bash}
snakemake -s prep-txomes-for-peptigate.snakefile --software-deployment-method conda -j 8
Then, supply this processed data and config files to run peptigate. The peptigate pipeline predicts peptides (sORF and cleavage) from transcriptome assemblies. We run this as a separate step because peptigate is its own Snakemake workflow in a GitHub repository and is not currently installable. We ran peptigate from the commit 148823239aad41a8f03da37f5499b00c8a79de40. We cloned the repository, copied our config and input data files to the repo folder, and ran peptigate with this for loop:
{bash}
for infile in input_configs/tsa_tick_sg_transcriptomes/*yml
do
snakemake --software-deployment-method conda -j 1 -k --configfile $infile
done
We then transferred the results back to this repo into the outputs/tsa_tick_sg_transcriptomes folder and analyzed them with the snakefile analyze-peptigate-outputs.snakefile.
{bash}
snakemake -s analyze-peptigate-outputs.snakefile --software-deployment-method conda -j 8 -k
Finally, we analyzed the results from this last snakefile using the notebooks in the notebooks directory.
To run these notebooks, we use the tidyjupyter environment:
{bash}
mamba env create -n tidyjupyter --file envs/tidyjupyter.yml
conda activate tidyjupyter
Data
The data analyzed in this repository is recorded in in the CSV file tick_sg_transcriptomes_tsa.csv.
We searched the NCBI Transcriptome Shotgun Assembly sequence database for salivary gland transcriptomes from tick species.
The CSV file records the transcriptome accession numbers as well as metadata about the size of the transcriptome assembly.
Overview
This repository records peptide discovery and analysis from publicly available tick salivary gland transcriptomes. As documented above, the analysis proceeds in three parts, beginning with data acquisition, progressing to peptide prediction, and finishing with peptide analysis.
Description of the folder structure
- envs/: Contains conda environment yaml files used by snakemake and to run the snakemake pipelines and notebooks.
- inputs/: Contains input files for the analysis as well as models for tools used in the repository.
- notebooks/: Contains jupyter notebooks that analyze the predicted peptides.
- scripts/: Scripts executed by the snakemake pipelines.
- LICENSE: Details re-use constraints.
- README.md: Documents the project and provides run instructions.
analyze-peptigate-outputs.snakefile: Documents the steps taken to analyze (annotate and compare) the peptide sequences predicted by peptigate.prep-txomes-for-peptigate.snakefile: Documents the steps taken to prepare the TSA transcriptomes for peptigate.- .github/, .vscode/, .gitignore, .pre-commit-config.yaml, Makefile, pyproject.toml: Snakemake template files that control the developer environment of the repository. See the Arcadia-Science/snakemake-template for more details.
Compute Specifications
prep-txomes-for-peptigate.snakefile: Ran on a MacBookPro 2021 with 64 Gb of RAM and running MacOS Ventura 13.4. We executed all commands in a terminal running Rosetta.- peptigate pipeline: Ran on an AWS EC2 instance type
g4dn.2xlargerunning AMI Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 20.04) 20240122 (AMI ID ami-07eb000b3340966b0). Note the pipeline runs many tools that use GPUs. analyze-peptigate-outputs.snakefile: Ran on an AWS EC2 instance typeg4dn.2xlargerunning AMI Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 20.04) 20240122 (AMI ID ami-07eb000b3340966b0). Note the tool AutoPeptideML runs on a GPU.
Notes about prep-txomes-for-peptigate.snakefile
The TSA is notoriously unreliable for downloading files. While in theory one should be able to download transcriptome assemblies and predicted proteins directly from the TSA using NCBI's Entrez-Direct tool, in practice this approach is spotty due to outages. When using this approach, we received error message like the following for a subset of transcriptomes:
$ esearch -db nuccore -query GKHV01 | efetch -format fasta > GKHV01_fasta.fa
WARNING: FAILURE ( Wed Mar 27 18:04:01 UTC 2024 )
nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -query_key 1 -WebEnv MCID_66045f91e16b2b7f5e329221 -retstart 0 -retmax 1 -db nuccore -rettype fasta -retmode text -tool edirect -edirect 21.6 -edirect_os Linux -email ubuntu@ip-172-31-6-147.us-west-1.compute.internal
EMPTY RESULT
SECOND ATTEMPT
WARNING: FAILURE ( Wed Mar 27 18:04:04 UTC 2024 )
nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -query_key 1 -WebEnv MCID_66045f91e16b2b7f5e329221 -retstart 0 -retmax 1 -db nuccore -rettype fasta -retmode text -tool edirect -edirect 21.6 -edirect_os Linux -email ubuntu@ip-172-31-6-147.us-west-1.compute.internal
EMPTY RESULT
LAST ATTEMPT
ERROR: FAILURE ( Wed Mar 27 18:04:06 UTC 2024 )
nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -query_key 1 -WebEnv MCID_66045f91e16b2b7f5e329221 -retstart 0 -retmax 1 -db nuccore -rettype fasta -retmode text -tool edirect -edirect 21.6 -edirect_os Linux -email ubuntu@ip-172-31-6-147.us-west-1.compute.internal
EMPTY RESULT
QUERY FAILURE
We received this error: * With and without using an NCBI API * With the entrez-direct tool installed via NCBI's installation method or via conda * On a local (MacOS) computer and on an AWS EC2 instance.
As such, we provide a backup url from which to download the transcriptome assemblies. This two-step download approach is baked into the Snakefile. Further, while one can programmatically download protein predictions in amino acid or nucleotide format if they are available using entrez-direct, the TSA only provides links for the contigs in nucleotide format or the proteins in amino acid format. Given this, and how spotty entrez-direct coverage of the TSA is, we chose to predict proteins using TransDecoder for all transcriptomes, whether they have annotations available or not. These decisions are documented in docstrings in the Snakefile and executed with the code itself.
Overview of results
The results covered here are documented in greater detail in the analysis notebooks in the notebooks folder.
Tick salivary gland transcriptomes contain thousands of predicted peptides, many of which have predicted anti-inflammatory or anti-pruritic bioactivity. We predicted peptide sequences from 29 publicly available tick salivary gland transcriptomes as well as the A. americanum (whole body, midgut, and salivary gland) transcriptome assembled in a previous pilot. In total, peptigate predicted 226,538 peptides (17,928 cleavage, 208,610 sORF) from 19 tick species from the genera Amblyomma, Hyalomma, Ixodes, Ornithodoros, and Rhipicephalus.
Peptides with predicted anti-inflammatory bioactivity
See this notebook for more information. We predicted that 5,142 distinct peptide sequences (2,320 cleavage, 3,822 sORF) had anti-inflammatory bioactivity (see this issue for how we predicted anti-inflammatory bioactivity). The machine learning model we used had a 71% accuracy rate. Given this low accuracy rate and the high number of predictions, we struggled with paring down this list of peptides to hone in on those worth experimentally validating from this data alone.
Peptides that are similar to known peptides with antipruritic activity
See this notebook for more information.
There are very few peptides with evidence of anti-pruritic effects so we could not create a machine learning model to identify this bioactivity. Instead, we BLASTp’d our peptide predictions against a database of protein sequences for four peptides with evidence of anti-pruritic activity: calcitonin gene-related peptide, dynorphin, tachykinin-4, and ziconotide. We also BLASTp'd against votuclais, a small tick protein that sequesters histamine; votucalis is not a peptide, as it is greater than 100 amino acids.
We identified 106 peptides (2 cleavage, 104 sORF) from 16 species that had hits to anti-pruritic peptides, the majority of which matched calcitonin gene-related peptide. About 70% of these sequences had hits against the Human Peptide Atlas, indicating that they might have homology (and shared function) with human peptides. (Note we assume this is so high because we used BLAST to detect sequences of interest in the first place). We again clustered all predicted peptides at 80% sequence identity and joined this information to our anti-pruritic peptide predictions; in total, the 106 peptides belonged to 92 clusters, suggesting that we recovered largely independent sequences.
Contributing
See how we recognize feedback and contributions to our code.
Owner
- Name: Arcadia Science
- Login: Arcadia-Science
- Kind: organization
- Location: United States of America
- Website: https://www.arcadiascience.com/
- Twitter: ArcadiaScience
- Repositories: 16
- Profile: https://github.com/Arcadia-Science
Citation (CITATION.cff)
cff-version: 1.2.0
message: If you use this software, please cite the associated publication.
title: Predicting peptides from tick salivary glands that suppress host detection
doi: 10.57844/arcadia-gfhy-d2f3
authors:
- family-names: Bell
given-names: Audrey
affiliation: Arcadia Science
orcid: https://orcid.org/0009-0008-2270-1613
- family-names: Borges
given-names: Adair L.
affiliation: Arcadia Science
orcid: https://orcid.org/0000-0002-1477-7908
- family-names: Cheveralls
given-names: Keith
affiliation: Arcadia Science
orcid: https://orcid.org/0000-0002-4157-6087
- family-names: Chou
given-names: Seemay
affiliation: Arcadia Science
orcid: https://orcid.org/0000-0002-7271-303X
- family-names: Donnelly
given-names: Justin
affiliation: Arcadia Science
orcid: https://orcid.org/0009-0000-1480-7372
- family-names: Hochstrasser
given-names: Megan L.
affiliation: Arcadia Science
orcid: https://orcid.org/0000-0002-4404-078X
- family-names: McDaniel
given-names: Elizabeth A.
affiliation: Arcadia Science
orcid: https://orcid.org/0000-0003-4692-0913
- family-names: Reiter
given-names: Taylor
affiliation: Arcadia Science
orcid: https://orcid.org/0000-0002-7388-421X
- family-names: Weiss
given-names: Emily C.P.
affiliation: Arcadia Science
orcid: https://orcid.org/0000-0002-9960-0270
preferred-citation:
title: Predicting peptides from tick salivary glands that suppress host detection
type: article
doi: 10.57844/arcadia-gfhy-d2f3
authors:
- family-names: Borges
given-names: Adair L.
affiliation: Arcadia Science
orcid: https://orcid.org/0000-0002-1477-7908
- family-names: Chou
given-names: Seemay
affiliation: Arcadia Science
orcid: https://orcid.org/0000-0002-7271-303X
- family-names: Reiter
given-names: Taylor
affiliation: Arcadia Science
orcid: https://orcid.org/0000-0002-7388-421X
- family-names: Weiss
given-names: Emily C.P.
affiliation: Arcadia Science
orcid: https://orcid.org/0000-0002-9960-0270
year: 2025
GitHub Events
Total
- Release event: 1
- Delete event: 1
- Push event: 2
- Public event: 1
- Pull request event: 1
- Pull request review event: 1
- Create event: 2
Last Year
- Release event: 1
- Delete event: 1
- Push event: 2
- Public event: 1
- Pull request event: 1
- Pull request review event: 1
- Create event: 2
Dependencies
- actions/checkout v3 composite
- actions/setup-python v4 composite
- actions/checkout v3 composite
- snakemake/snakemake-github-action v1.24.0 composite