https://github.com/const-ae/prodd_old

Differential Detection for Label-free (LFQ) Mass Spec Data

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Differential Detection for Label-free (LFQ) Mass Spec Data

Basic Info

Host: GitHub
Owner: const-ae
Language: C++
Default Branch: master
Homepage:
Size: 7.07 MB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Created over 8 years ago · Last pushed over 8 years ago

Metadata Files

Readme

README.Rmd

---
title: "proDD"
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  cache = TRUE,
  fig.path = "tools/README-fig/",
  cache.path = "tools/README-cache/",
  message = FALSE,
  warning = FALSE
)
```

Differential Detection with Label-free Mass Spec Data

## Overview

This package provides a framework to find proteins in mass spec data that are differentially detected between groups.
It is designed to deal with high number of missing values (i.e. zeros) and can nonetheless give reliable significance
estimates.

It is thus applicable to data from affinity purification experiments such as BioID.

## Method

The algorithm is build around the fact that a missing values are more likely to occur if the intensity of protein
is low, which means that a missing observation can tell us something. In the first step the algorithm quantifies this dependency by
estimating a logistic regression of the chance to miss a value depending on the underlying intensity. This model
is fitted using Hamiltonian Monte Carlo method, because precise estimates of the sigmoid are necessary for reliable
downstream calculations. In the second step the group means for each condition and protein are estimated using a
maximum likelihood approach. To find which groups are actually significantly expressed in the last step a moderated
t-test is applied to each protein.

Unlike other approaches that have been suggested in the literature that rely on imputing missing values using _ad hoc_
methods, such as just using half the global minimum, proDD exploits the information provided by the zeros in a 
structured way and focuses on the MLE of the group means, which are sufficient to establish significance.

## Workflow

Installation

```{r}
# Install directly from github
devtools::github("const-ae/proDD")
```


Let's assume that `X` is a matrix where each row contains the intensity for one protein and each column is one 
sample, which can be grouped into conditions.

```{r, echo=FALSE}
library(proDD)
source("tests/testthat/helper_datageneration.R")
data <- generate_zero_inflated_data_with_effect(N_genes=100, N_rep=3, perc_changed = 0, mu0=8.5, nu0=5, sigma0=0.4, location=8, scale=-0.3)
X <- cbind(data$X, data$Y)
colnames(X) <- c(paste0("A_", 1:3), paste0("B_", 1:3))
X <- X[rowSums(X) != 0, ]
```

```{r}
library(proDD)
head(X, n=10)
```

```{r, echo=FALSE}
ComplexHeatmap::Heatmap((X != 0)*1.0, cluster_rows=FALSE, cluster_columns= FALSE,
                        col=c("black", "lightgrey"), name="Value Observed")
```


For subsequent steps a description of the samples is necessary, i.e. which sample belongs to which condition.
For this we will create a dataframe containing that information:

```{r}
data_description <-  data.frame(Condition=as.factor(c(rep("A", 3), rep("B", 3))), 
                                Replicate=c(1:3, 1:3))
data_description$Sample <- paste0(data_description$Condition, data_description$Replicate)
data_description

design <-  model.matrix(Sample ~ Condition - 1, data_description)
design
```

Now we can apply the algorithm that consists of three steps to that data

1. Estimate the parameters for the variance moderation:

    ```{r}
    vm_est <- estimate_variance_moderation(X, design)
    ```

2. Estimate the sigmoid that describes the chance to miss an observation:

    ```{r}
    sig_est <- estimate_sigmoid(X, data_description, vm_est$nu_est, vm_est$sigma2_est, chains=1)
    ```

3. Estimate the means of each condition per protein

    ```{r}
    group_locations <- estimate_group_means(X, design, vm_est$nu_est, vm_est$sigma2_est, sig_est$location_est, sig_est$scale_est)
    ```

4. Lastly, apply the moderated t-test to the group means to find differentially detected proteins:

    ```{r}
    result <- detect_differences(X, design, data_description, d0=vm_est$nu_est, s0=vm_est$sigma2_est,
                                 group_locations=group_locations, comparison=c("A", "B"))
    
    head(result, n=10)
    ```



## Note

This project is still work in progress and although the algorithm is working well, the API will probably change dramatically.

Owner

Name: Constantin
Login: const-ae
Kind: user
Location: Heidelberg, Germany
Company: EMBL

Website: https://twitter.com/const_ae
Repositories: 64
Profile: https://github.com/const-ae

PhD Student, Biostats, R

GitHub Events

Total

Last Year

Committers

Last synced: about 1 year ago

All Time

Total Commits: 21
Total Committers: 1
Avg Commits per committer: 21.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
const-ae	a**5@g**m	21

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

DESCRIPTION cran

R >= 3.0.2 depends
Rcpp >= 0.12.11 depends
methods * depends
limma * imports
maxLik * imports
rstan >= 2.16.2 imports
rstantools >= 1.2.0 imports
testthat * suggests

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/const-ae/prodd_old

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.Rmd

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies