ghmcn

Graphical 5hmC Convolutional Networks

https://github.com/ay-lab/ghmcn

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, biorxiv.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.1%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Graphical 5hmC Convolutional Networks

Basic Info
  • Host: GitHub
  • Owner: ay-lab
  • License: other
  • Language: Python
  • Default Branch: master
  • Size: 3.2 MB
Statistics
  • Stars: 2
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 3
Created about 3 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation Authors Zenodo

README.md

Graph 5hmC Convolutional Network (GhmCN)

To Dos - ~~Bash script to integrate rna-seq into rnaseq.csv registry~~ - ~~child issue: add Rscript that TPM-normalizes~~ - ~~Add example data from feature counts~~ - ~~Add example code to perform rna-seq integration~~ - ~~Add code that transforms `BAM` enrichment files into binned signal of interest~~ - ~~Child issue: Rscript that generates the ROIs from given Hi-C data~~ - ~~Add script and explanaiton about how to transform ice-normalized Hi-C data into accessible data for script~~ - ~~child issue: all the bunch of required scripts~~ - ~~Add note to please add following required `R packages` if using the auxiluary functions.~~ - ~~Need to make the R libraries part of the conda installation.~~

TOC

Intro

GhmCN uses the graph structure from GC-merge, a tool able to integrate both enrichment signal with spatial genomic information to predict gene expression. Our implementation is broadly detailed in our bioRxiv. If you use our implementation cite us and please kindly cite the work of Bigness et al.

In this repository we provide our conda environment, six main programs (two with core classes and 4 for data processing), and utility functions to pre-process the required inputs: genomic contacts, gene expression and genomic enrichment profiles. In our Zenodo deposit we included a tarball with example raw data to illustrate how to process it to be used by our code. We also incldued two processed example datasets.

The required input for our code can be obtained (but not limited to) from the output of Hi-C-Pro, enrichment signal from BAM files and output of featureCounts. All of these processed files are required to make use of our GCN, but you can derive your own input files following the appropriate formatting.

1. Conda Environment

We used conda and a CUDA-able (NVIDIA GPU) environment for our work. Please use the description below to replicate our working environment.

```

Generate a new environment:

conda create -n GhmCN python==3.7.9 conda activate GhmCN

We tested our code with pytorch<=1.12.0

which we suggest to install

conda install pytorch=1.12.0 torchvision -c pytorch

Check your CUDA version:

/usr/local/cuda/bin/nvcc --version

Use appropriate CUDA/torch version and install

pip install torch-scatter torch-sparse torch-cluster torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.12.0+cu102.html pip install torch-geometric

Other required programs:

conda install numpy scipy pandas matplotlib conda install -c conda-forge ordered-set conda install -c anaconda scikit-learn ```

1.1 R packages used

  • GenomicRanges
  • GenomicAlignments
  • stringr

NOTE Included in yaml file

2. Zenodo deposit

In order to help reproducing our results and aid in the speeding of research, we uploaded both raw and processed files into this Zenodo deposit. This will help illustrate the functionality of all the pieces of code we have included in our repository. We included two tarballs that contain raw and processed data respectively. They are not required to run our code but there are some hardcoded structures that you may want to change if don't want to use the pre-sets output directories. Below the relevance of each or the tarball files.

Before testing code

Download and decompress the B72_and_Naive_CD4T_examples.tar.gz folders under src/data/ to test the code with our processed datasets.

This tarball contains two example datasets, one in each folder: - B72: B cells stimulated with LPS and IL4 for 72h - Naive_CD4T: Naïve CD4 Single Positive T cells

Before testing utilities functions

Download and decompress the tarball example_raw_data.tar.gz file under the example/ folder. This tarball contains the raw_data folder with the subfolders that have the data used in the example code snippets below.

This tarball contains three subfolders under raw_data that contains the following: - hic/ outputs from HiC-Pro - _abs.bed - _iced.matrix - cms/ mapping results of CMS IP and INPUT - _IP.bam - _INPUT.bam - rna/ outputs from using featureCounts - _featureCounts.txt

3. Data Preparation

In this section we describe the characteristics of the required input files and a guide for data preparation from HiC-Pro, BAM and featureCounts outputs. Under this Zenodo deposit we provided an example dataset (example_raw_data.tar.gz) for Naïve B cells (B00 codename as used in our publication). Check Zenodo Deposit for details.

Beware that the uncompressed size of this archive is *73G*.

3.1. Hi-C

The code requires the DNA interaction maps to be of an specific format: divided in files by chromosomes (e.g. hic_chr1.txt, ..., hic_chrN.txt) and having the second column as the leading coordinate.

Example input from our Naive_CD4T's hic_chr1.txt 3000000 3000000 49.9945577453461 3010000 3000000 51.25039943757 3020000 3000000 24.852610473894 ... 195360000 195360000 35.0175818484145 195370000 195360000 82.6174691895984 195370000 195370000 77.7522404318486 The ice-normalized output from Hi-C-Pro requires heavy reformatting to achieve this simpler structure. ice-normalized data are 3-row files with thousands of columns where the coordinate (1st and 2nd row) is written in scientific notation. The third row contains the normalized contact information. We added a set of auxiliary scripts to ease the reformatting from 3xN to Nx3 matrices. We understand that an option was to load the matrix as a whole, transpose it and store back but we hit memory limitations in multiple instances and tries. These auxiliary functions/scripts make use of Perl and R languages. Before proceeding with th eexample below, decompress example_raw_data.tar.gz from our Zenodo deposit inside GhmCN/example. ``` icenormalized=./example/rawdata/hic/B0010000iced.matrix referencematrix=./example/rawdata/hic/B0010000abs.bed celltype_name=B00

Run our auxiliary function as

./utils/RiceC.sh $icenormalized $referencematrix $celltypename `` In general terms what this function does step-wise is: - Divide iced data per row into three individual files. - Furhter divide in individual files (each with 1000000 coordinates) for easier processing. - Run a self-made R script to sequentially load these subfiles and eliminate scientific notation. - Merge cleaned files - Paste columns into single file. - run Perl'sHiC_Pro2Readable.plto generate the reformatted file from the parsed data. - Separate by chromosome and store them undersrc/data/CellType`

3.2. featureCounts

This section covers the case where the user obtained count signal per gene for gene expression assessment using featureCounts. This is the case where the output file has 7+ columns where the 1st, 6th and 7th column are extracted for gene expression. The AddCellExpression.sh shell script will run the FeatureCounts2TPM.R Rscript to TPM normalize the raw featureCounts counts. These normalized counts are integrated into the expression file under src/data/rnaseq.csv file, used by the main scripts.

Example input from our Naive_CD4T's expression data insisde the src/data/rnaseq.csv file gene_id,...,Naive_CD4T,... Xkr4,...,0,... Rp1,...,0,... Sox17,...,0,... ... Gm20806,...,0,... Gm20854,...,0,... Erdr1,...,2.898,...

Having the featureCounts output, it is as easy as to run our auxiliary script as follows to integrate expression data into the main expression file: ``` celltypename=B00 expressionfilepath=./example/rawdata/rna/B00_featureCounts.txt

Run our auxiliary function as

./utils/AddCellExpression.sh $celltypename $expressionfilepath `` **NOTE**: if thecelltypenameis already present in thesrc/data/rnaseq.csv` file, it will not be overwritten nor duplicated. Running this script will end in notifying the user about the presence of such cell type.

3.3. BAM Enrichment files

Pre-formatting the Hi-C data is a requirement to obtain the 5hmC (or any other) enrichment signal per bin, since the scripts below makes use of the hic_chr1.txt, ..., hic_chr19.txt coordinates to generate the regions of interests (ROI) to extract the enrichment signal from them. 1. To collect the cell-specific signal for a set of genomic interactions, we first need to delimit the ROIs. - This step makes use of the script Generate_ROIs.sh that will take the Start and End of coverage per chromosome for a given cell type. ``` celltype_name=B00

# Run our auxiliary function as ./utils/GenerateROIs.sh $celltypename - This will generate the appropriate `Rdata` file within the appropriate `celltype_name` folder under `src/data` to be used by other helper functions - **IMPORTANT:** Make sure you consistently use the same `celltype_name` across the `Hi-C` and `rnaseq.csv` processing steps. 2. Now we can pull the enrichment signal out of these cell-specific ROIs. - We will make use of the Rscript `Collect_Count_Signal_CellSpecific.R` that takes 4 positional arguments: - `BAM` file path - Enrichment condition name (as CMS or INPUT in our case, yours may vary if not using 5hmC; e.g. `T53ChIP`) - `celltype_name`. **IMPORTANT:** Has to match a `src/data/` subfolder. - `OUTDIR`: **this must be `src/data/celltype_name`**. - I did not find a reliable way within an `Rscript` to find its own script location (even asking [`chatGPT`](https://chat.openai.com/chat) ;-) ) to automate this. bam=./example/rawdata/cms/B00CMSIP.bam condition=CMS celltypename=B00 outdir=./src/data/${celltype_name}

 # Run our auxiliary function as
 ./utils/Collect_Count_Signal_CellSpecific.R $bam $condition $celltype_name $outdir 
 ```
  • This will populate the celltype_name data folder with the required signal for the mark "CSM"
  • Repeat for all of the different marks you intent to analyze/integrate (e.g. with the INPUT enrichment we provided).

Example input from ./src/data/Naive_CD4T/chr1_10000bp_CMSIP.count, our Naive_CD4T's 5hmC enrichment file chr1 3000000 3010000 250 chr1 3010000 3020000 225 chr1 3020000 3030000 222 ... chr1 195340000 195350000 236 chr1 195350000 195360000 44 chr1 195360000 195370000 202

4. Running the Two Main Codes

First run process_inputs_EGA.py on the command line from the src directory. After completion you are ready to train your network using run_models_EGA.py. For instance: ``` cd ./src python processinputsEGA.py -c B00 -rf 0

wait (est. 30-40 mins)

python runmodelsEGA.py -c B00 -rf 0 ```

This will run the model for the B00 cell type using the classification task. The inputs to these flags can be changed so that the model can run for different cell lines (-c) as well as for either classification (-rf 0) or regression (-rf 1). Check inside each script for additional flag options.

5. Running GNNExplainer (Visualizing Results)

After the models have been trained, and the scores reported, you can run our expanded implementation of GNNExplainer. You can select an specific node (-n) you want explained or can provide a gene name (-gn) of interest, excluding genes located in chrX, chrY and chrM.

python ./src/GNNExplainNode.py \ --UsePredExp \ -od ./src/data/Nodes_Explained/Top10 \ -ep 1500 \ -rf 0 \ -hm 2 \ -cr 10000 \ -df data \ --DrawWeight \ --DrawCoords\ -mt ./src/data/Naive_CD4T_CMS_10000bp/saved_runs/model_2021-11-29-at-21-23-15.pt \ -mo Naive_CD4T \ -c Naive_CD4T

Citing our work

TBU -- current state is journal pre-release.

Owner

  • Login: ay-lab
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: "Graph 5hmC Convolutional Network"
version: 1.0.0
date-released: 2024-05

authors:
  - family-names: "González-Avalos"
    given-names: "Edahí"
    orcid: "https://orcid.org/0000-0002-6817-4854"
    affiliation: "University of California San Diego; La Jolla Institute for Immunology"
    email: "edahi@lji.org"
  - family-names: "Ay"
    given-names: "Ferhat"
    orcid: "https://orcid.org/0000-0002-0708-6914"
    affiliation: "La Jolla Institute for Immunology; University of California San Diego"
    email: "ferhatay@lji.org"

references:
  - type: miscellaneous
    name: "Software Testing by Dante Bolzan"
    details: "Dante tested the repository ensuring reproducibility using provided example data in Zenodo."

GitHub Events

Total
Last Year