mspms

R package for the analysis of Multiplex Substrate Profiling by Mass Spectrometry for Proteases (MSP-MS) Data

https://github.com/baynec2/mspms

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: pubmed.ncbi, ncbi.nlm.nih.gov, nature.com
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.5%) to scientific vocabulary

Keywords

protease proteomics proteomics-data-analysis
Last synced: 6 months ago · JSON representation

Repository

R package for the analysis of Multiplex Substrate Profiling by Mass Spectrometry for Proteases (MSP-MS) Data

Basic Info
  • Host: GitHub
  • Owner: baynec2
  • License: other
  • Language: R
  • Default Branch: main
  • Homepage:
  • Size: 53.9 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 3
  • Releases: 1
Topics
protease proteomics proteomics-data-analysis
Created almost 2 years ago · Last pushed 9 months ago
Metadata Files
Readme Changelog License

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

ggplot2::theme_set(
  ggplot2::theme_minimal()
)
```



# `r BiocStyle::Biocpkg("mspms")`



[![R-CMD-check](https://github.com/baynec2/mspms/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/baynec2/mspms/actions/workflows/R-CMD-check.yaml)
[![Codecov test
coverage](https://codecov.io/gh/baynec2/mspms/branch/main/graph/badge.svg)](https://app.codecov.io/gh/baynec2/mspms?branch=main)
[![Bioc release status](http://www.bioconductor.org/shields/build/release/bioc/mspms.svg)](https://bioconductor.org/checkResults/release/bioc-LATEST/mspms)
[![Bioc devel status](http://www.bioconductor.org/shields/build/devel/bioc/mspms.svg)](https://bioconductor.org/checkResults/devel/bioc-LATEST/mspms)
[![Bioc downloads rank](https://bioconductor.org/shields/downloads/release/mspms.svg)](http://bioconductor.org/packages/stats/bioc/mspms/)
[![Bioc support](https://bioconductor.org/shields/posts/mspms.svg)](https://support.bioconductor.org/tag/mspms)
[![Bioc history](https://bioconductor.org/shields/years-in-bioc/mspms.svg)](https://bioconductor.org/packages/release/bioc/html/mspms.html#since)
[![Bioc last commit](https://bioconductor.org/shields/lastcommit/devel/bioc/mspms.svg)](http://bioconductor.org/checkResults/devel/bioc-LATEST/mspms/)
[![Bioc dependencies](https://bioconductor.org/shields/dependencies/release/mspms.svg)](https://bioconductor.org/packages/release/bioc/html/mspms.html#since)


## Introduction

The goal of `r BiocStyle::Biocpkg("mspms")` is provide a R Package that can be used to easily and
reproducibly analyze data resulting from the [Multiplex Substrate Profiling by
Mass Spectrometry (MSP-MS) method](https://pubmed.ncbi.nlm.nih.gov/36948708/).

Additionally, we provide a [graphical user interface powered by shiny
apps](https://github.com/baynec2/mspms-shiny) that allows for a user to
utilize the method without requiring any R coding knowledge.

Multiplex Substrate Profiling by Mass Spectrometry
[(MSP-MS)](https://pubmed.ncbi.nlm.nih.gov/36948708/) is a powerful method used
to determine the substrate specificity of proteases. This method is of interest
for groups interested in the study of proteases and their role as regulators of
biological pathways, whether applied to the study of disease states, the
development of diagnostic and prognostic tests, generation of tool compounds, or
rational design of protease targeting therapeutics. Analysis of the MS based
data produced by MSP-MS is a multi-step process involving detection and
quantification of peptides, imputation, normalization, cleavage sequence
identification, statistics, and data visualizations. This process can be
challenging, especially for scientists with limited programming experience. To
overcome these issues, we developed the mspms R package to facilitate the
analysis of MSP-MS data utilizing good software/data analysis practices.

mspms differs from existing proteomics packages that are available in the
Bioconductor project in that it is specifically designed to analyze MSP-MS data.
This involves unique preprocessing steps and data visualizations tailored to
this specific assay.

In order to do so, mspms uses several excellent packages from the Bioconductor
project to provide a framework to easily and reproducibly analyze MSP-MS data.
These include:

-   `r BiocStyle::Biocpkg("QFeatures")`: management and processing of peptide 
quantitative features and sample metadata.
    
-   `r BiocStyle::Biocpkg("limma")`: Powerful linear models and 
empirical Bayesian methods for omics data.

-   `r BiocStyle::Biocpkg("MsCoreUtils")`: log2 transformation, 
imputation (QRILC), and normalization (center.median).

Some excellent non-Bioconductor packages are also utilized.

-   *ggplot2*: data visualizations

-   *heatmaply*: interactive heatmaps

-   *rstatix*: conduct statistics

-   *ggseqlogo*: iceLogo motif visualization.

-   *downloadthis*: gives users the ability to download data from standard .html
    report generated by mspms::generate_report().

Lastly, *mspms* takes advantage of iceLogos ([Colaert, N. et al. Nature Methods
6,786-787 (2009)](http://www.ncbi.nlm.nih.gov/pubmed/19876014)) to visualize
over represented amino acid motifs relative to a background set by implementing
components of the Java software in R.

## Installation

You can install the development version from github.

```{r,eval = FALSE}
devtools::install_github("baynec2/mspms")
```

## Quickstart

To generate a general report using your own data, run the following code. It
requires data that has been prepared for mspms data analysis by a converter
function. For more information see subsequent sections.

```{r,eval = FALSE}
mspms::generate_report(
  prepared_data = mspms::peaks_prepared_data,
  outdir = "../Desktop/mspms_report"
)
```

The above command will generate a .html file containing a generic mspms
analysis.

There is much more that can be done using the mspms package- see the following
sections for more information.

## Overview

Functions in this package are logically divided into 4 broad types of functions.

1.  Pre-processing data. These functions are focused on getting all of the data
    necessary for a mspms analysis together and in a consistent format.

2.  Data processing. These functions allow the user to process the proteomics 
data (imputation, normalization, etc).

3.  Statistics. These methods allow the user to perform basic statistics on the 
normalized/processed data.

4.  Data visualization. These functions allow the user to visualize the data.

**Note that only a subset of functions are exported to the user.** Helper functions can be found in the
[preprocessing_helper_functions.R](https://github.com/baynec2/mspms/blob/main/R/preprocessing_helper_functions.R),
[processing_helper_functions.R](https://github.com/baynec2/mspms/blob/main/R/processing_helper_functions.R),[icelogo_helper_functions.R](https://github.com/baynec2/mspms/blob/main/R/iceLogo_helper_fuctions.R), [statistics_helper_functions.R](https://github.com/baynec2/mspms/blob/main/R/statistics_helper_functions.R), and [iceLogo_helper_functions.R](https://github.com/baynec2/mspms/blob/main/R/iceLogo_helper_fuctions.R) files. 

These functions were not intended to be directly used by the user and are not 
exported in an effort to make the package API cleaner and more intuitive.

## Pre-processing data

Pre-processing data here refers to standardizing the data exported from
different software solutions for analyzing mass spectrometry-based proteomics
and adding useful information specific to the MSP-MS assay.

Specifically, the data from the file exported from the upstream software is
parsed to contain the **peptide** (detected peptide sequence, cleavages are
represented as "\_"), **library_id** (name of the peptide in the peptide library
each detected peptide maps to), and intensities for each sample.

Information about each peptide relative to the peptide it was derived from is
then added. This includes the **cleavage_seq** (the motif the peptide a user
specified n residues up and down from the cleavage event), **cleavage_pos**
(position of the peptide library the cleavage product was cleaved at ),
**peptide_type** (whether the detected peptide is a cleavage product of the
peptide library, or a full length member of the peptide library).

All of the information is then loaded as a QFeatures object along with colData
(metadata about each sample) where it is stored as a SummarizedExperiment object
named "peptides".

### A note on colData

As briefly described above, the colData contains metadata describing the samples
you are trying to analyze.

**It is very important that this contains at least the following columns: quantCols","group","condition",and "time".**

The quantCols field contains the name of each sample. These names must match
whatever the samples are named in the files exported from the upstream
proteomics software.

The group field contains text specifying what group each sample belongs to. All
replicate samples should have the same group name

The condition field contains the experimental conditions that vary in your
MSP-MS experiment (this depends on the experimental question, could be type of
protease, the type of inhibitor, etc)

The time field contains the duration of incubation of your protease of interest
with the peptide library for each sample.

An example of a valid colData file is shown below.

```{r}
colData <- readr::read_csv(system.file("extdata/colData.csv",
  package = "mspms"
))

head(colData)
```

### PEAKS

Analysis using a .csv exported from [PEAKS
software](https://www.bioinfor.com/peaks-studio/) is supported.

an exported .csv file can be generated from PEAKS as follows:

1.  Create a New Project in PEAKS.\
2.  Select data to add to PEAKS project.\
3.  Add data, rename samples, set correct parameters. • Highlight all the added
    data.\
    • Select Create new sample for each file.\
    • Select appropriate instrument and fragmentation method.\
    • Set Enzyme to “None”.\
    • Rename samples according to enzymes, time points, and replicates.\
    • Can also ensure appropriate data is selected for these.\
    • Select “Data Refinement” to continue.\
4.  Data Refinement.\
    • On the top right, select MSP-MS as the predefined parameters.\
    • Parameters shown are what should be used for MSP-MS data analysis.\
    • Select “Identification” to continue.\
5.  Peptide Identification.\
    • On the top right, select MSP-MS as the predefined parameters.\
    • Parameters shown are what should be used for MSP-MS data analysis.\
    • Identification -\> Select Database -\> TPD_237.\
    • Identification -\> no PTMs.\
    • Can remove the PTMs by highlighting them and selecting Remove\
6.  Label Free Quantification.\
    • Make sure Label Free is selected.\
    • Group samples -\> add quadruplicates of samples to new group.

```{r}

library(dplyr)
library(mspms)

# File path of peaks lfq file
lfq_filepath <- system.file("extdata/peaks_protein-peptides-lfq.csv",
  package = "mspms"
)

colData_filepath <- system.file("extdata/colData.csv", package = "mspms")

# Prepare the data
peaks_prepared_data <- mspms::prepare_peaks(lfq_filepath,
  colData_filepath,
  quality_threshold = 0.3,
  n_residues = 4
)

peaks_prepared_data

```


### Fragpipe

Data exported from [Fragpipe](https://github.com/Nesvilab/FragPipe) as a .tsv
file is supported.

To generate this file, run a [LFQ Fragpipe
workflow](https://fragpipe.nesvilab.org/docs/tutorial_lfq.html) using your
desired parameters (we recommend using match between runs).

You will find a file named "combined_peptide.tsv" in the output folder from
Fragpipe.

This can be loaded into mspms as follows:

```{r}
combined_peptide_filepath <- system.file("extdata/fragpipe_combined_peptide.tsv",
  package = "mspms"
)

colData_filepath <- system.file("extdata/colData.csv", package = "mspms")


fragpipe_prepared_data <- prepare_fragpipe(
  combined_peptide_filepath,
  colData_filepath
)

fragpipe_prepared_data
```
### Proteome Discoverer

Analysis of a PeptideGroups.txt file exported from [proteome
discoverer](https://www.thermofisher.com/us/en/home/industrial/mass-spectrometry/liquid-chromatography-mass-spectrometry-lc-ms/lc-ms-software/multi-omics-data-analysis/proteome-discoverer-software.html) 
is supported.

To generate this .txt file:
1. Ensure that non-normalized/ non-scaled abundances are included in your 
PeptideGroup table.   
2. File > export > To Text (tab delimited) > Items to be exported - check 
the Peptide Groups box > Export.   

Note that the fileIDs of each sample in proteome discoverer (By default 
F1 - Fn) must match the quantCols in colData.   

```{r}
peptide_groups_filepath <- system.file(
  "extdata/proteome_discoverer_PeptideGroups.txt",
  package = "mspms"
)

colData_filepath <- system.file("extdata/proteome_discover_colData.csv",
  package = "mspms"
)

prepared_pd_data <- prepare_pd(peptide_groups_filepath, colData_filepath)

prepared_pd_data
```
### DIA-NN

We are currently investigating how well the method works with DIA data. 
To allow for this we have created a parser to make DIA-NN output compatible with
mspms.

Note that this could be used with DIA-NN alone or with Fragpipe (MSFragger-DIA) 
uses DIA-NN for quantification. Here we are using the file output from Fragpipe.

Use caution when using this approach, we haven't thoroughly investigated how 
well DIA data works with mspms using ground truth data yet.

```{r}
precursor_filepath = system.file(
  "extdata/diann_report.pr_matrix.tsv",
  package = "mspms"
)
colData_filepath = system.file(
  "extdata/diann_colData.csv",
  package = "mspms"
)

prepared_diann_data <- prepare_diann(precursor_filepath,colData_filepath)

prepared_diann_data
```


## Data Processing

Data processing includes imputation and normalization of the proteomics data.
This is all done in a QFeatures object using MSCoreUtils methods. Data from 
each step is stored as a SummarizedExperiment within the QFeatures object with
the indicated name. 

1. Data is first log2 transformed ("peptides_log").   
2. Data is normalized using the center.median method ("peptides_log_norm").  
3. Missing values are then imputed using the QRILC method 
("peptides_log_impute_norm"). 
4. Data is then reverse log2 transformed ("peptides_norm"). 

```{r}
set.seed(2)
processed_qf <- process_qf(peaks_prepared_data)
processed_qf
```

## Statistics

MSP-MS analysis has traditionally used T-tests to what peptides are
significantly different relative to T0. The authors of the assay suggest using a
T-test derived FDR adjusted p value threshold \<= 0.05 and log2 fold change
threshold \>= 3. 

We recommend that the user now uses a Limma based approach that uses a linear 
modeling and empirical Bayesian approach. This is a approach well suited to 
proteomics data that is powerful, flexible, and fast.

This limma based approach (with stats calculated relative to time 0 for each 
condition) can be utilized as follows:

```{r}
limma_stats <- mspms::limma_stats(processed_qf = processed_qf)
```

Alternatively, we provide a user-friendly implementation of the traditional
t-test approach as follows:

```{r}
log2fc_t_test_data <- mspms::log2fc_t_test(processed_qf = processed_qf)
```


## Data Visualizations

We provide a number of data visualizations as part of the mspms package.

### QC Checks

A critical initial step in MSP-MS data analysis is a rigorous quality control
assessment. Given the reliance on a specific peptide library, calculating the
percentage of library peptides detected in each sample provides a
straightforward yet powerful indicator of data integrity. Anomalously low
detection rates signal potential instrument or experimental problems.

We can see the percentage of the peptide library that were undetected among each
group of replicates as follows (for full length peptides a threshold of \~\< 10
is considered good while for cleavage products a threshold \~\< 5 is good):

```{r}
plot_qc_check(processed_qf,
  full_length_threshold = 10,
  cleavage_product_threshold = 5
)
```

We can also see in what percentage of samples each peptide in the peptide
library were undetected in.

```{r}
plot_nd_peptides(processed_qf)
```

### Tidying our data

We can then convert our QFeatures object into a "long" tibble using the 
mspms_tidy function.

This makes the data easy to manipulate prior to ggplot2 or plotly based plotting
functions. 

```{r}
mspms_tidy_data <- mspms_tidy(processed_qf)
```

### Heatmap

We can inspect our data using an interactive heatmap as follows:

```{r,eval = FALSE}
plot_heatmap(mspms_tidy_data)
```

note that the interactive plot is not returned in this vignette 
due to size constraints. 

### PCA

We can also inspect our data using a PCA as follows.

```{r}
plot_pca(mspms_tidy_data, value_colname = "peptides_norm")
```

### Volcano plots


We can generate volcano plots showing the significant peptides from each
condition as follows:

```{r}
plot_volcano(log2fc_t_test_data)
```

### Cleavage Position Plots

We can also create plots showing the number of times we observe a cleavage at
each position of the peptide_library among a set of peptides of interest.

We can do this as follows:

```{r}
sig_cleavage_data <- log2fc_t_test_data %>%
  dplyr::filter(p.adj <= 0.05, log2fc > 3)

p1 <- mspms::plot_cleavages_per_pos(sig_cleavage_data)

p1
```

### iceLogo

[Icelogos](https://www.nature.com/articles/nmeth1109-786) are a method for
visualizing conserved patterns in protein sequences through probability theory.

As part of the mspms R package, we have implemented the logic to generate
[iceLogos](https://genesis.ugent.be/uvpublicdata/icelogo/iceLogo.pdf) in R.

To use this, we first have to define a vector of cleavage sequences that we are
interested in. We will do this for the significant CatA cleavages below.

```{r}
catA_sig_cleavages <- log2fc_t_test_data %>%
  dplyr::filter(p.adj <= 0.05, log2fc > 3) %>%
  dplyr::filter(condition == "CatA") %>%
  dplyr::pull(cleavage_seq) %>%
  unique()
```

We also need to know all of the possible cleavages we could see among our
peptide library. We can generate that as below:

```{r}
all_possible_8mers_from_228_library <- calculate_all_cleavages(
  mspms::peptide_library$library_real_sequence,
  n_AA_after_cleavage = 4
)
```

We can then generate an iceLogo relative to the background of all possible
cleavages in our peptide library like so:

```{r}
plot_icelogo(catA_sig_cleavages,
  background_universe = all_possible_8mers_from_228_library
)
```

We also provide a function that conveniently plots all icelogos from all
conditions from the experiment all together.

```{r}
sig_cleavage_data <- log2fc_t_test_data %>%
  dplyr::filter(p.adj <= 0.05, log2fc > 3)

plot_all_icelogos(sig_cleavage_data)
```

### Plotting a Time Course

We can also plot the intensities of peptides of interest over time.

First we need to define which peptides we are interested in. Here we will use
the t-test results, and look at the peptide with the max log2fc.

```{r}
max_log2fc_pep <- log2fc_t_test_data %>%
  dplyr::filter(p.adj <= 0.05, log2fc > 3) %>%
  dplyr::filter(log2fc == max(log2fc)) %>%
  pull(peptide)
```

We can then filter our tidy mspms data to only include this peptide and plot

```{r}
p1 <- mspms_tidy_data %>%
  dplyr::filter(peptide == max_log2fc_pep) %>%
  plot_time_course()

p1
```


```{r}
sessionInfo()
```

Owner

  • Name: Charlie Bayne
  • Login: baynec2
  • Kind: user
  • Location: La Jolla CA
  • Company: UCSD

Gonzalez Lab - UCSD - Biomedical Sciences Ph.D. Program

GitHub Events

Total
  • Create event: 3
  • Release event: 3
  • Issues event: 8
  • Issue comment event: 5
  • Push event: 14
Last Year
  • Create event: 3
  • Release event: 3
  • Issues event: 8
  • Issue comment event: 5
  • Push event: 14

Committers

Last synced: 6 months ago

All Time
  • Total Commits: 148
  • Total Committers: 2
  • Avg Commits per committer: 74.0
  • Development Distribution Score (DDS): 0.007
Past Year
  • Commits: 56
  • Committers: 1
  • Avg Commits per committer: 56.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Charlie Bayne b****2@g****m 147
Charlie Bayne 4****2@u****m 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 29
  • Total pull requests: 0
  • Average time to close issues: 14 days
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 1.14
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 12
  • Pull requests: 0
  • Average time to close issues: 19 days
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 1.08
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • baynec2 (34)
Pull Request Authors
Top Labels
Issue Labels
enhancement (7) bug (1)
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • bioconductor 1,805 total
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 0
    (may contain duplicates)
  • Total versions: 3
  • Total maintainers: 1
proxy.golang.org: github.com/baynec2/mspms
  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 5.9%
Average: 6.1%
Dependent repos count: 6.3%
Last synced: 6 months ago
bioconductor.org: mspms

Tools for the analysis of MSP-MS data

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 1,805 Total
Rankings
Dependent repos count: 0.0%
Dependent packages count: 30.3%
Average: 40.6%
Downloads: 91.5%
Maintainers (1)
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 2.10 depends
  • MASS * imports
  • NormalyzerDE * imports
  • dplyr * imports
  • magrittr * imports
  • outliers * imports
  • purrr * imports
  • tidyr * imports
  • truncnorm * imports
.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v4 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/test-coverage.yaml actions
  • actions/checkout v4 composite
  • actions/upload-artifact v4 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite