magical

R and MATLAB scripts for MAGICAL

https://github.com/xichensf/magical

Keywords

bayesian-inference gene-regulatory-network non-commercial-license single-cell-multiomics

Last synced: 10 months ago · JSON representation ·

Repository

R and MATLAB scripts for MAGICAL

Basic Info

Host: GitHub
Owner: xichensf
License: other
Language: R
Default Branch: main
Homepage:
Size: 4.03 MB

Statistics

Stars: 10
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 1

Topics

bayesian-inference gene-regulatory-network non-commercial-license single-cell-multiomics

Created over 3 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

MAGICAL analysis

MAGICAL (Multiome Accessible Gene Integration Calling And Looping) is a hierarchical Bayesian approach that leverages paired scRNA-seq and scATAC-seq data from different conditions to map disease-associated transcription factors, chromatin sites, and genes as regulatory circuits. By simultaneously modeling signal variation across cells and conditions in both omics data types, MAGICAL achieved high accuracy on circuit inference.

R and MATLAB scripts are provided for the use of MAGICAL with R v4.0.0 or MATLAB 2020a or later. The complete single cell data files are over 100GB and also the Bayesian learning process in MAGICAL may take hours to run on a local machine. Before working on a complete dataset, we suggest users to download our demo dataset to get familiar with MAGICAL functions (running time ~15 mins).

alt text

MAGICAL input

MAGICAL only requires gene symbols, peak coordinates, read count and cell meta information (including cell type, sample/subject ID and sample group/condition). These information are very fundamental and can be easily obtained from any single cell multioimc dataset. We provide a script Multiomicsinputfor_MAGICAL.R to show how to prepare the necessary input files from the single cell multiomics data for use with MAGICAL. The script includes code to extract needed information from Seurat processed scRNA-seq data and ArchR or Signac processed scATAC-seq data.

Cell type

MAGICAL infers regulatory circuits for each cell type. Therefore, the input scRNA-seq and scATAC-seq data should be cell type labelled. Users would need to select one cell type and use the provided R script to prepare the following input files for MAGICAL.

Candidate genes (DEG) and chromatin sites (DAS)

As differential calling is usually done seperately during the scRNA-seq and scATAC-seq data processing, we highly recommand preparing these two files using similar differential statistics cutoffs.

Candidate gene file: a list of gene symbols
Candidate chromatin site file: a three-column matrix of chr, point1, and point2

```diff source('MAGICAL_functions.R')

library(Matrix) library(dplyr)

pre-selected candidate genes and peaks for the cell type

Candidategenefilepath = 'Demo input files/Cell type candidate genes.txt' Candidatepeakfilepath = 'Demo input files/Cell type candidate peaks.txt' ```

scRNA-seq read count

The scRNA-seq read count information from cells labelled to the selected cell type.

scRNA read count file: a three-column matrix with gene index, cell index, and RNA read count
scRNA gene name file: a two-column matrix with gene index and gene name.
scRNA cell meta file: a five-column matrix with cell index, cell barcode, cell type label, sample/subject ID, and condition

Note, each sample must have a unique name and this name should be the same in the scATAC-seq data to allow MAGICAL to pair the two modalities together.

```diff

filtered scRNA data of the cell type

scRNAreadcountfilepath = 'Demo input files/Cell type scRNA read count.txt' scRNAgenefilepath = 'Demo input files/scRNA genes.txt' scRNAcellmetafile_path = 'Demo input files/Cell type scRNA cell meta.txt' ```

scATAC-seq read count

The scATAC-seq read count information from cells labelled to the selected cell type.

scATAC read count file: a three-column matrix with peak index, cell index, and ATAC read count
scATAC peak coordinate file: a four-column matrix with peak index, chr, peak_point1, peak_point2.
scATAC cell meta file: a five-column matrix with cell index, cell barcode, cell type label, sample/subject ID, and condition

```diff

filtered scATAC data of the cell type

scATACreadcountfilepath = 'Demo input files/Cell type scATAC read count.txt' scATACpeakfilepath = 'Demo input files/scATAC peaks.txt' scATACcellmetafile_path = 'Demo input files/Cell type scATAC cell meta.txt' ```

TF motif mapping (prior)

We map 870 motifs from the chromVARmotifs library to all peaks and get the binary mapping relationship.

Motif mapping file: a three-column matrix with peak index, motif index, and binary binding.
Motif name file: a two-column matrix with motif index and motif names.

Note, the motif names of the same TF can be very different if they are collected from different databases. To avoid duplicated motifs and minimize redundance, we required using the corresponding gene name for each TF motif.

```diff

TF motif prior on all ATAC peaks

Motifmappingfilepath = 'Demo input files/Motif mapping prior.txt' Motifnamefilepath = 'Demo input files/Motifs.txt' ```

TAD boundary (prior)

TAD boundary information can be easily obatined from HiC profiles or similar experiments. We include in our demo a GM12878 HiC-based TAD file including ~6000 domains with median size 400kb for blood context analysis. * TAD file: a three column matrix with chr, left_boundary, and right_boundary

In case no proper TAD information or HiC profile is available for the context being studied, another option to use relative distance to TSS (e.g. 500kb) as prior to initally pair peaks and genes. A hg38 RefSeq file is included in the demo for TSS reference.

```diff

TAD prior

TADfilepath = 'Demo input files/RaoGM1287840kbTopDomTADsfilteredhg38.txt' distance_control=5e5

Refseq file for transcription starting site extraction

Refseqfilepath = 'Demo input files/hg38Refseq.txt' ```

Load data:

``` loadeddata <- Dataloading(Candidategenefilepath, Candidatepeakfilepath, scRNAreadcountfilepath, scRNAgenefilepath, scRNAcellmetafilepath, scATACreadcountfilepath, scATACpeakfilepath, scATACcellmetafilepath, Motifmappingfilepath, Motifnamefilepath, Refseqfile_path)

loading all input data ...

We detected 2 conditions from the meta file.

The input scRNAseq data includes 36601 genes, 13248 cells from 12 samples/subecjts.

The input scATACseq data includes 144387 peaks, 13248 cells from 12 samples/subecjts.

There are paired data for 12 samples/subecjts. (check sample IDs if this number is lower than expected)

870 motifs, 945 candidate chromatin sites and 1143 candidate genes are provided.

```

MAGICAL circuit inference

Build candidate circuits

To identify candidate disease-modulated triads, candidate chromatin sites are associated with TFs by motif sequence matching. These sites are then linked to the candidate genes by requiring them to be within the same TAD or within a user controlled distance.

```diff

Candidate circuits construction with TAD

Candidatecircuits <- CandidatecircuitsconstructionwithTAD(loadeddata, TADfilepath) ```

Or

```diff

Candidate circuits construction without TAD

Candidatecircuits <- CandidatecircuitsconstructionwithoutTAD(loadeddata, distance_control) ```

Then

```diff

Model initialization

Initialmodel<-MAGICALinitialization(loadeddata, Candidatecircuits) ```

Infer circuit linkages

MAGICAL uses a Bayesian framework to iteratively model chromatin accessibility and gene expression variation across cells and conditions and estimate the strength of the circuit TF-peak-gene linkages. The TF-peak binding confidence and the hidden TF activity are optimized to fit to the chromatin accessibility data. The estimated TF binding strength, TF activity and the input gene expression data are used to infer the peak-gene looping. The updated states of TF-peak-gene linkages based on the estimated strength are used to initialize the next round of estimations.

```diff

Model parameter estimation

source('R/MAGICAL_functions.R')

Circuitslinkageposterior<-MAGICALestimation(loadeddata, Candidatecircuits, Initialmodel, iteration_num = 1000)

MAGICAL finished 10 percent

MAGICAL finished 20 percent

MAGICAL finished 30 percent

MAGICAL finished 40 percent

MAGICAL finished 50 percent

MAGICAL finished 60 percent

MAGICAL finished 70 percent

MAGICAL finished 80 percent

MAGICAL finished 90 percent

MAGICAL finished 100 percent

```

MAGICAL output

Finally, optimized circuits fitting the variation in both modalities are selected. Circuit genes, assocaited chromatin sites and the regulatory TFs will be written to MAGICALselectedregulatory_circuits.txt. MAGICAL uses default thresholds (posterior probabilities on TF-peak binding and peak-gene looping) for circuit selection. Users can adjust these thresholds in the provided scripts to allow more or fewer outputs. Note, running MAGICAL on our demo files may return very similar but slighly different results to the example output file we provided due the nature of Bayesian algorithsm.

``` MAGICALcircuitsoutput(Outputfilepath = 'MAGICALselectedregulatorycircuits.txt', Candidatecircuits, Circuitslinkageposterior)

MAGICAL selected regulatory circuits with 90 TFs, 249 peaks and 225 genes.

```

Human PBMC single cell multiomics data

The sample-paired scRNA-seq and scATAC-seq data used in the MAGICAL paper have been deposited with the Gene Expression Omnibus under accession no. GSE220190.

Users can also download datasets used in the papaer using the links below: * The Seurat-intergated R object of S. aureus scRNA-seq data can be downloaded here (15GB). * The ArchR-integrated R project of S. aureus scATAC-seq data can be downloaded here (34GB, including arrow files). * The ArchR-integrated R project of COVID-19 scATAC-seq data can be downloaded here (7GB, including arrow files).

Reference

Xi Chen et al., Mapping disease regulatory circuits at cell-type resolution from single-cell multiomics data, Nature Computational Science, 3:644–657, (2023).

Contact

Questions regarding the single cell multiome data and MAGICAL framework can be emailed to Xi Chen (xchen@flatironinstitute.org) and Yuan Wang (yuanwang@cs.princeton.edu). Other questions about our work should be emailed to Olga G. Troyanskaya (ogt@genomics.princeton.edu) and Stuart C. Sealfon (stuart.sealfon@mssm.edu).

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use this software, please cite our publication as below."
authors:
  - family-names: Xi
    given-names: Chen
title: Mapping disease regulatory circuits at cell-type resolution from single-cell multiomics data
version: v1.1
date-released: 2023-05-19

magical

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

MAGICAL analysis

MAGICAL input

Cell type

Candidate genes (DEG) and chromatin sites (DAS)

pre-selected candidate genes and peaks for the cell type

scRNA-seq read count

filtered scRNA data of the cell type

scATAC-seq read count

filtered scATAC data of the cell type

TF motif mapping (prior)

TF motif prior on all ATAC peaks

TAD boundary (prior)

TAD prior

Refseq file for transcription starting site extraction

loading all input data ...

We detected 2 conditions from the meta file.

The input scRNAseq data includes 36601 genes, 13248 cells from 12 samples/subecjts.

The input scATACseq data includes 144387 peaks, 13248 cells from 12 samples/subecjts.

There are paired data for 12 samples/subecjts. (check sample IDs if this number is lower than expected)

870 motifs, 945 candidate chromatin sites and 1143 candidate genes are provided.

MAGICAL circuit inference

Build candidate circuits

Candidate circuits construction with TAD

Candidate circuits construction without TAD

Model initialization

Infer circuit linkages

Model parameter estimation

MAGICAL finished 10 percent

MAGICAL finished 20 percent

MAGICAL finished 30 percent

MAGICAL finished 40 percent

MAGICAL finished 50 percent

MAGICAL finished 60 percent

MAGICAL finished 70 percent

MAGICAL finished 80 percent

MAGICAL finished 90 percent

MAGICAL finished 100 percent

MAGICAL output

MAGICAL selected regulatory circuits with 90 TFs, 249 peaks and 225 genes.

Human PBMC single cell multiomics data

Reference

Contact

Citation (CITATION.cff)

GitHub Events

Total

Last Year