aric
Accurate and robust inference of cell type proportions from bulk gene expression or DNA methylation data
Science Score: 46.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: nature.com -
✓Committers with academic emails
1 of 2 committers (50.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.4%) to scientific vocabulary
Keywords
Repository
Accurate and robust inference of cell type proportions from bulk gene expression or DNA methylation data
Basic Info
- Host: GitHub
- Owner: XWangLabTHU
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Homepage: https://xwanglabthu.github.io/ARIC/
- Size: 820 KB
Statistics
- Stars: 2
- Watchers: 1
- Forks: 3
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
ARIC
- Section 1: Introduction
- Section 2: Installation Tutorial
- Section 3: A Quick Tutorial for Demo data Deconvolution
- Section 4: Applications on TCGA Ovarian Cancer
- Section 5: Computational Efficiency Comparison
- Citation
Section 1: Introduction
ARIC is a bioinfomatics software for bulk gene expression and DNA methylation data deconvolution. ARIC utilizes a novel two-step marker selection strategy, including component-wise condition number-based feature collinearity elimination and adaptive outlier markers removal. This strategy can systematically obtain effective markers that ensure a robust and precise weighted υ-SVR-based rare proportion prediction.
Section 2: Installation Tutorial
Section 2.1: System requirement
ARIC is implemented using python and can be install in windows, UNIX/LINUX and MAC OS. ARIC requires python version >= 3 and all the dependent packages will be installed using pip.
Section 2.2: Installation
ARIC can be installed from pypi by the following command. The source code can also be downloaded from pypi.
Shell
pip install ARIC
Section 3: A Quick Tutorial for Demo data Deconvolution
In this section, we will demonstrate how to perform bulk data deconvolution using the demo data.
Section 3.1: Quick Start
We provide a small demo data here. There are two main files in csv format. One saves the mixture bulk data and another saves external reference data. Just put the file path to the function "ARIC", and the program will do every thing.
```python from ARIC import *
ARIC(mixpath="mix.csv", refpath="ref.csv") ```
Section 3.2: Function Introduction
The main function in ARIC is decipher.
Python
ARIC(mix_path, ref_path, save_path=None, marker_path=None,
selected_marker=False, scale=0.1, delcol_factor=10,
iter_num=10, confidence=0.75, w_thresh=10,
unknown=False, is_methylation=False)
- 'mix_path': Path to mixture data, must be an csv file with colnames and rownames.
- 'ref_path': Path to reference data, must be an csv file with colnames and rownames.
- 'save_path': Where to save the deconvolution results. Default: mixpathprefix_prop.csv.
- 'marker_path': Path to the user specificed markers. Must be an csv file.
- 'selected_marker': Output selected marker for every sample. Marker files will be saved in a folder named "sample_marker.csv".
- 'scale': Used for controlling the convergence of SVR. A smaller value makes the convergence much faster. Default: 0.1.
- 'delcol_factor': Used for controlling the extent of removing collinearity. Default: 10.
- 'iter_num': Iterative numbers of outliers detection. Default: 10.
- 'confidence': Ratio of remained markers in each outlier detection loop. Default: 0.75.
- 'w_thresh': Threshold to cut the weights designer. Default: 10.
- 'unknown': Whether to estimate unknown content proportion.
- 'is_methylation': Whether the data type belongs to methylation data. If true, preliminary marker selection will be performed.
Section 4: Applications on TCGA Ovarian Cancer
In this part, we will demonstrate how to use ARIC for ovarian cancer patients' classification. Users can follow the below instruction to reproduce the results in our article.
Ovarian cancer patients data with survival information can be downloaded from LinkedOmics directly. LM22 reference data can be downloaded from CIBERTSORT. The survival information will be saved in file "HumanTCGA_OVMSClinicalClinical01282016BIClinicalFirehose.tsi".
We provide the scaled data and survival information here.
Section 4.1: Deconvolution for All Patients
First, put "mixscaled.csv" and "refscaled.csv" to your folder.
```Python from ARIC import *
ARIC(mixpath="mixscaled.csv", refpath="refscaled.csv", savepath="ovARIC.csv", selected_marker=True)
```
Then, wait for the deconvolution done.
```Python
--------------WELCOME TO ARIC----------------
Data reading finished! ARIC Engines Start, Please Wait...... 100%|█████████████████████████████████████████████████████████████| 514/514 [01:14<00:00, 6.89it/s] Deconvo Results Saving! Finished! ```
There will be 2 main outputs. The first one is estimated proportion file named "ovARIC.csv". The second is a folder named "mixscaled" (the same name with the input mixture file). All the markers selected by ARIC for each sample will be saved in folder "mix_scaled".
Section 4.2: Survival Analysis
Then, we perform survival analysis based on R package "survival" and "survminer".
```R library(survival) library(survminer) library(tidyr) library(gridExtra)
import survival information
surinfo <- read.table(file = "HumanTCGAOVMSClinicalClinical01282016BIClinical_Firehose.tsi", header = TRUE, row.names = 1) tmprowname <- rownames(sur_info)
data <- read.csv(file = "ov_ARIC.csv", header = TRUE, row.names = 1)
selected_celltype <- c("T.cells.CD8", "T.cells.gamma.delta", "Macrophages.M1", "NK.cells.resting", "NK.cells.activated")
data <- data[selectedcelltype, ] data <- colSums(x = data) propmedian <- median(data)
highrisk <- names(data)[which(data <= propmedian)] lowrisk <- names(data)[which(data > propmedian)]
label <- rep(x = "tumor", times = ncol(surinfo)) names(label) <- colnames(surinfo) idxhigh <- which(names(label) %in% highrisk) label[idxhigh] <- "high" idxlow <- which(names(label) %in% lowrisk) label[idxlow] <- "low"
surinfo <- rbind(surinfo, label) rownames(surinfo) <- c(tmprowname, "risk")
surinfo <- as.data.frame(t(surinfo[, which(colnames(sur_info) %in% names(data))]))
surinfo <- dropna(data = surinfo, c("overallsurvival", "status")) surinfo <- transform(surinfo, overallsurvival = as.numeric(overallsurvival)) surinfo <- transform(surinfo, status = as.numeric(status))
fit <- survfit(Surv(overallsurvival, status) ~ risk, data=surinfo) ggsurvplot(fit, pval = TRUE, conf.int = TRUE)
rescox <- coxph(Surv(overallsurvival, status) ~ risk, data=sur_info)
summary(res_cox)$conf.int
```
Then, we can get the survival curve and hazard ratio like below.
exp(coef) exp(-coef) lower .95 upper .95
risklow 0.7424766 1.346844 0.593249 0.9292413
Section 5: Computational Efficiency Comparison
Computational efficiency is largely influenced by the number of markers. Therefore, we compared the computation time with both different methods and different marker numbers.
We generated in silico mixed gene expression data with different marker numbers (100, 500, 1000, 2000, 5000, 7000 and 10000). In order to get a reliable result, we generated 10 datasets and each dataset had 50 samples for each situation with different marker numbers. We compared the mean computation time for 50 samples and summarized the results in the foloowing table.
ARIC needs to compute component-wise condition number after removing each collinearity marker. Therefore, the computational time will be longer than matrix operation-based methods like dtangle and deconRNAseq. The computational efficiencies of ARIC, EPIC and FARDEEP are at the same level. In addition, computational time of CIBERSORT growth drastically with the increase of marker number. Thus, we strongly recommended filtering low quality markers before deconvolution.
Citation
Zhang, Wei, Hanwen Xu, Rong Qiao, Bixi Zhong, Xianglin Zhang, Jin Gu, Xuegong Zhang, Lei Wei, and Xiaowo Wang. "ARIC: accurate and robust inference of cell type proportions from bulk gene expression or DNA methylation data." Briefings in Bioinformatics 23, no. 1 (2022): bbab362.
Owner
- Name: XWangLabTHU
- Login: XWangLabTHU
- Kind: organization
- Repositories: 4
- Profile: https://github.com/XWangLabTHU
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Committers
Last synced: over 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| ZweiTHU | s****2@g****m | 16 |
| Honchkrow | zw@s****n | 2 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 2
- Total pull requests: 0
- Average time to close issues: 3 days
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 1.5
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- ryrl9703 (2)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 10 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 4
- Total maintainers: 1
pypi.org: aric
ARIC: Accurate and robust inference of cell type proportions from bulk gene expression or DNA methylation data
- Homepage: https://xwanglabthu.github.io/ARIC/
- Documentation: https://aric.readthedocs.io/
- License: GPL V3
-
Latest release: 1.0.1
published almost 2 years ago
Rankings
Maintainers (1)
Dependencies
- numpy *