aric

Accurate and robust inference of cell type proportions from bulk gene expression or DNA methylation data

https://github.com/xwanglabthu/aric

Science Score: 46.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: nature.com
  • Committers with academic emails
    1 of 2 committers (50.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.4%) to scientific vocabulary

Keywords

deconvolution methylation rna-seq
Last synced: 6 months ago · JSON representation

Repository

Accurate and robust inference of cell type proportions from bulk gene expression or DNA methylation data

Basic Info
Statistics
  • Stars: 2
  • Watchers: 1
  • Forks: 3
  • Open Issues: 0
  • Releases: 1
Topics
deconvolution methylation rna-seq
Created over 4 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License

README.md

ARIC

Section 1: Introduction

ARIC is a bioinfomatics software for bulk gene expression and DNA methylation data deconvolution. ARIC utilizes a novel two-step marker selection strategy, including component-wise condition number-based feature collinearity elimination and adaptive outlier markers removal. This strategy can systematically obtain effective markers that ensure a robust and precise weighted υ-SVR-based rare proportion prediction.

Section 2: Installation Tutorial

Section 2.1: System requirement

ARIC is implemented using python and can be install in windows, UNIX/LINUX and MAC OS. ARIC requires python version >= 3 and all the dependent packages will be installed using pip.

Section 2.2: Installation

ARIC can be installed from pypi by the following command. The source code can also be downloaded from pypi.

Shell pip install ARIC

Section 3: A Quick Tutorial for Demo data Deconvolution

In this section, we will demonstrate how to perform bulk data deconvolution using the demo data.

Section 3.1: Quick Start

We provide a small demo data here. There are two main files in csv format. One saves the mixture bulk data and another saves external reference data. Just put the file path to the function "ARIC", and the program will do every thing.

```python from ARIC import *

ARIC(mixpath="mix.csv", refpath="ref.csv") ```

Section 3.2: Function Introduction

The main function in ARIC is decipher.

Python ARIC(mix_path, ref_path, save_path=None, marker_path=None, selected_marker=False, scale=0.1, delcol_factor=10, iter_num=10, confidence=0.75, w_thresh=10, unknown=False, is_methylation=False)

  • 'mix_path': Path to mixture data, must be an csv file with colnames and rownames.
  • 'ref_path': Path to reference data, must be an csv file with colnames and rownames.
  • 'save_path': Where to save the deconvolution results. Default: mixpathprefix_prop.csv.
  • 'marker_path': Path to the user specificed markers. Must be an csv file.
  • 'selected_marker': Output selected marker for every sample. Marker files will be saved in a folder named "sample_marker.csv".
  • 'scale': Used for controlling the convergence of SVR. A smaller value makes the convergence much faster. Default: 0.1.
  • 'delcol_factor': Used for controlling the extent of removing collinearity. Default: 10.
  • 'iter_num': Iterative numbers of outliers detection. Default: 10.
  • 'confidence': Ratio of remained markers in each outlier detection loop. Default: 0.75.
  • 'w_thresh': Threshold to cut the weights designer. Default: 10.
  • 'unknown': Whether to estimate unknown content proportion.
  • 'is_methylation': Whether the data type belongs to methylation data. If true, preliminary marker selection will be performed.

Section 4: Applications on TCGA Ovarian Cancer

In this part, we will demonstrate how to use ARIC for ovarian cancer patients' classification. Users can follow the below instruction to reproduce the results in our article.

Ovarian cancer patients data with survival information can be downloaded from LinkedOmics directly. LM22 reference data can be downloaded from CIBERTSORT. The survival information will be saved in file "HumanTCGA_OVMSClinicalClinical01282016BIClinicalFirehose.tsi".

We provide the scaled data and survival information here.

Section 4.1: Deconvolution for All Patients

First, put "mixscaled.csv" and "refscaled.csv" to your folder.

```Python from ARIC import *

ARIC(mixpath="mixscaled.csv", refpath="refscaled.csv", savepath="ovARIC.csv", selected_marker=True)

```

Then, wait for the deconvolution done.

```Python

--------------WELCOME TO ARIC----------------

Data reading finished! ARIC Engines Start, Please Wait...... 100%|█████████████████████████████████████████████████████████████| 514/514 [01:14<00:00, 6.89it/s] Deconvo Results Saving! Finished! ```

There will be 2 main outputs. The first one is estimated proportion file named "ovARIC.csv". The second is a folder named "mixscaled" (the same name with the input mixture file). All the markers selected by ARIC for each sample will be saved in folder "mix_scaled".

Section 4.2: Survival Analysis

Then, we perform survival analysis based on R package "survival" and "survminer".

```R library(survival) library(survminer) library(tidyr) library(gridExtra)

import survival information

surinfo <- read.table(file = "HumanTCGAOVMSClinicalClinical01282016BIClinical_Firehose.tsi", header = TRUE, row.names = 1) tmprowname <- rownames(sur_info)

data <- read.csv(file = "ov_ARIC.csv", header = TRUE, row.names = 1)

selected_celltype <- c("T.cells.CD8", "T.cells.gamma.delta", "Macrophages.M1", "NK.cells.resting", "NK.cells.activated")

data <- data[selectedcelltype, ] data <- colSums(x = data) propmedian <- median(data)

highrisk <- names(data)[which(data <= propmedian)] lowrisk <- names(data)[which(data > propmedian)]

label <- rep(x = "tumor", times = ncol(surinfo)) names(label) <- colnames(surinfo) idxhigh <- which(names(label) %in% highrisk) label[idxhigh] <- "high" idxlow <- which(names(label) %in% lowrisk) label[idxlow] <- "low"

surinfo <- rbind(surinfo, label) rownames(surinfo) <- c(tmprowname, "risk")

surinfo <- as.data.frame(t(surinfo[, which(colnames(sur_info) %in% names(data))]))

surinfo <- dropna(data = surinfo, c("overallsurvival", "status")) surinfo <- transform(surinfo, overallsurvival = as.numeric(overallsurvival)) surinfo <- transform(surinfo, status = as.numeric(status))

fit <- survfit(Surv(overallsurvival, status) ~ risk, data=surinfo) ggsurvplot(fit, pval = TRUE, conf.int = TRUE)

rescox <- coxph(Surv(overallsurvival, status) ~ risk, data=sur_info)

summary(res_cox)$conf.int

```

Then, we can get the survival curve and hazard ratio like below.

exp(coef) exp(-coef) lower .95 upper .95 risklow 0.7424766 1.346844 0.593249 0.9292413


ARIC Predicted OV patients' survival curve


Section 5: Computational Efficiency Comparison

Computational efficiency is largely influenced by the number of markers. Therefore, we compared the computation time with both different methods and different marker numbers.

We generated in silico mixed gene expression data with different marker numbers (100, 500, 1000, 2000, 5000, 7000 and 10000). In order to get a reliable result, we generated 10 datasets and each dataset had 50 samples for each situation with different marker numbers. We compared the mean computation time for 50 samples and summarized the results in the foloowing table.


Computational Efficiency Comparison


ARIC needs to compute component-wise condition number after removing each collinearity marker. Therefore, the computational time will be longer than matrix operation-based methods like dtangle and deconRNAseq. The computational efficiencies of ARIC, EPIC and FARDEEP are at the same level. In addition, computational time of CIBERSORT growth drastically with the increase of marker number. Thus, we strongly recommended filtering low quality markers before deconvolution.

Citation

Zhang, Wei, Hanwen Xu, Rong Qiao, Bixi Zhong, Xianglin Zhang, Jin Gu, Xuegong Zhang, Lei Wei, and Xiaowo Wang. "ARIC: accurate and robust inference of cell type proportions from bulk gene expression or DNA methylation data." Briefings in Bioinformatics 23, no. 1 (2022): bbab362.

Owner

  • Name: XWangLabTHU
  • Login: XWangLabTHU
  • Kind: organization

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 18
  • Total Committers: 2
  • Avg Commits per committer: 9.0
  • Development Distribution Score (DDS): 0.111
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
ZweiTHU s****2@g****m 16
Honchkrow zw@s****n 2
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 2
  • Total pull requests: 0
  • Average time to close issues: 3 days
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 1.5
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ryrl9703 (2)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 10 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 4
  • Total maintainers: 1
pypi.org: aric

ARIC: Accurate and robust inference of cell type proportions from bulk gene expression or DNA methylation data

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 10 Last month
Rankings
Dependent packages count: 10.0%
Forks count: 15.3%
Stargazers count: 31.9%
Average: 40.4%
Dependent repos count: 67.6%
Downloads: 77.3%
Maintainers (1)
Last synced: 6 months ago

Dependencies

setup.py pypi
  • numpy *