Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, biorxiv.org, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.1%) to scientific vocabulary
Repository
Graphical 5hmC Convolutional Networks
Basic Info
- Host: GitHub
- Owner: ay-lab
- License: other
- Language: Python
- Default Branch: master
- Size: 3.2 MB
Statistics
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 3
Metadata Files
README.md
Graph 5hmC Convolutional Network (GhmCN)

To Dos
- ~~Bash script to integrate rna-seq into rnaseq.csv registry~~ - ~~child issue: add Rscript that TPM-normalizes~~ - ~~Add example data from feature counts~~ - ~~Add example code to perform rna-seq integration~~ - ~~Add code that transforms `BAM` enrichment files into binned signal of interest~~ - ~~Child issue: Rscript that generates the ROIs from given Hi-C data~~ - ~~Add script and explanaiton about how to transform ice-normalized Hi-C data into accessible data for script~~ - ~~child issue: all the bunch of required scripts~~ - ~~Add note to please add following required `R packages` if using the auxiluary functions.~~ - ~~Need to make the R libraries part of the conda installation.~~TOC
- Intro
- 1. Conda Environment
- 2. Zenodo
- 3. Data Preparation
- 4. Running the GCN
- 5. GNNExplainer Visualization
- Citation
Intro
GhmCN uses the graph structure from GC-merge, a tool able to integrate both enrichment signal with spatial genomic information to predict gene expression. Our implementation is broadly detailed in our bioRxiv. If you use our implementation cite us and please kindly cite the work of Bigness et al.
In this repository we provide our conda environment, six main programs (two with core classes and 4 for data processing), and utility functions to pre-process the required inputs: genomic contacts, gene expression and genomic enrichment profiles. In our Zenodo deposit we included a tarball with example raw data to illustrate how to process it to be used by our code. We also incldued two processed example datasets.
The required input for our code can be obtained (but not limited to) from the output of Hi-C-Pro, enrichment signal from BAM files and output of featureCounts. All of these processed files are required to make use of our GCN, but you can derive your own input files following the appropriate formatting.
1. Conda Environment
We used conda and a CUDA-able (NVIDIA GPU) environment for our work. Please use the description below to replicate our working environment.
```
Generate a new environment:
conda create -n GhmCN python==3.7.9 conda activate GhmCN
We tested our code with pytorch<=1.12.0
which we suggest to install
conda install pytorch=1.12.0 torchvision -c pytorch
Check your CUDA version:
/usr/local/cuda/bin/nvcc --version
Use appropriate CUDA/torch version and install
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.12.0+cu102.html pip install torch-geometric
Other required programs:
conda install numpy scipy pandas matplotlib conda install -c conda-forge ordered-set conda install -c anaconda scikit-learn ```
1.1 R packages used
- GenomicRanges
- GenomicAlignments
- stringr
NOTE Included in
yamlfile
2. Zenodo deposit
In order to help reproducing our results and aid in the speeding of research, we uploaded both raw and processed files into this Zenodo deposit. This will help illustrate the functionality of all the pieces of code we have included in our repository. We included two tarballs that contain raw and processed data respectively. They are not required to run our code but there are some hardcoded structures that you may want to change if don't want to use the pre-sets output directories. Below the relevance of each or the tarball files.
Before testing code
Download and decompress the B72_and_Naive_CD4T_examples.tar.gz folders under src/data/ to test the code with our processed datasets.
This tarball contains two example datasets, one in each folder:
- B72: B cells stimulated with LPS and IL4 for 72h
- Naive_CD4T: Naïve CD4 Single Positive T cells
Before testing utilities functions
Download and decompress the tarball example_raw_data.tar.gz file under the example/ folder. This tarball contains the raw_data folder with the subfolders that have the data used in the example code snippets below.
This tarball contains three subfolders under raw_data that contains the following:
- hic/ outputs from HiC-Pro
- _abs.bed
- _iced.matrix
- cms/ mapping results of CMS IP and INPUT
- _IP.bam
- _INPUT.bam
- rna/ outputs from using featureCounts
- _featureCounts.txt
3. Data Preparation
In this section we describe the characteristics of the required input files and a guide for data preparation from HiC-Pro, BAM and featureCounts outputs. Under this Zenodo deposit we provided an example dataset (example_raw_data.tar.gz) for Naïve B cells (B00 codename as used in our publication). Check Zenodo Deposit for details.
Beware that the uncompressed size of this archive is *73G*.
3.1. Hi-C
The code requires the DNA interaction maps to be of an specific format: divided in files by chromosomes (e.g. hic_chr1.txt, ..., hic_chrN.txt) and having the second column as the leading coordinate.
Example input from our Naive_CD4T's hic_chr1.txt
3000000 3000000 49.9945577453461
3010000 3000000 51.25039943757
3020000 3000000 24.852610473894
...
195360000 195360000 35.0175818484145
195370000 195360000 82.6174691895984
195370000 195370000 77.7522404318486
The ice-normalized output from Hi-C-Pro requires heavy reformatting to achieve this simpler structure. ice-normalized data are 3-row files with thousands of columns where the coordinate (1st and 2nd row) is written in scientific notation. The third row contains the normalized contact information. We added a set of auxiliary scripts to ease the reformatting from 3xN to Nx3 matrices. We understand that an option was to load the matrix as a whole, transpose it and store back but we hit memory limitations in multiple instances and tries. These auxiliary functions/scripts make use of Perl and R languages. Before proceeding with th eexample below, decompress example_raw_data.tar.gz from our Zenodo deposit inside GhmCN/example.
```
icenormalized=./example/rawdata/hic/B0010000iced.matrix
referencematrix=./example/rawdata/hic/B0010000abs.bed
celltype_name=B00
Run our auxiliary function as
./utils/RiceC.sh $icenormalized $referencematrix $celltypename
``
In general terms what this function does step-wise is:
- Divide iced data per row into three individual files.
- Furhter divide in individual files (each with 1000000 coordinates) for easier processing.
- Run a self-made R script to sequentially load these subfiles and eliminate scientific notation.
- Merge cleaned files
- Paste columns into single file.
- run Perl'sHiC_Pro2Readable.plto generate the reformatted file from the parsed data.
- Separate by chromosome and store them undersrc/data/CellType`
3.2. featureCounts
This section covers the case where the user obtained count signal per gene for gene expression assessment using featureCounts. This is the case where the output file has 7+ columns where the 1st, 6th and 7th column are extracted for gene expression. The AddCellExpression.sh shell script will run the FeatureCounts2TPM.R Rscript to TPM normalize the raw featureCounts counts. These normalized counts are integrated into the expression file under src/data/rnaseq.csv file, used by the main scripts.
Example input from our Naive_CD4T's expression data insisde the src/data/rnaseq.csv file
gene_id,...,Naive_CD4T,...
Xkr4,...,0,...
Rp1,...,0,...
Sox17,...,0,...
...
Gm20806,...,0,...
Gm20854,...,0,...
Erdr1,...,2.898,...
Having the featureCounts output, it is as easy as to run our auxiliary script as follows to integrate expression data into the main expression file:
```
celltypename=B00
expressionfilepath=./example/rawdata/rna/B00_featureCounts.txt
Run our auxiliary function as
./utils/AddCellExpression.sh $celltypename $expressionfilepath
``
**NOTE**: if thecelltypenameis already present in thesrc/data/rnaseq.csv` file, it will not be overwritten nor duplicated.
Running this script will end in notifying the user about the presence of such cell type.
3.3. BAM Enrichment files
Pre-formatting the
Hi-Cdata is a requirement to obtain the 5hmC (or any other) enrichment signal per bin, since the scripts below makes use of thehic_chr1.txt, ...,hic_chr19.txtcoordinates to generate the regions of interests (ROI) to extract the enrichment signal from them. 1. To collect the cell-specific signal for a set of genomic interactions, we first need to delimit the ROIs. - This step makes use of the scriptGenerate_ROIs.shthat will take theStartandEndof coverage per chromosome for a given cell type. ``` celltype_name=B00
# Run our auxiliary function as
./utils/GenerateROIs.sh $celltypename
- This will generate the appropriate `Rdata` file within the appropriate `celltype_name` folder under `src/data` to be used by other helper functions
- **IMPORTANT:** Make sure you consistently use the same `celltype_name` across the `Hi-C` and `rnaseq.csv` processing steps.
2. Now we can pull the enrichment signal out of these cell-specific ROIs.
- We will make use of the Rscript `Collect_Count_Signal_CellSpecific.R` that takes 4 positional arguments:
- `BAM` file path
- Enrichment condition name (as CMS or INPUT in our case, yours may vary if not using 5hmC; e.g. `T53ChIP`)
- `celltype_name`. **IMPORTANT:** Has to match a `src/data/` subfolder.
- `OUTDIR`: **this must be `src/data/celltype_name`**.
- I did not find a reliable way within an `Rscript` to find its own script location (even asking [`chatGPT`](https://chat.openai.com/chat) ;-) ) to automate this.
bam=./example/rawdata/cms/B00CMSIP.bam
condition=CMS
celltypename=B00
outdir=./src/data/${celltype_name}
# Run our auxiliary function as
./utils/Collect_Count_Signal_CellSpecific.R $bam $condition $celltype_name $outdir
```
- This will populate the
celltype_namedata folder with the required signal for the mark "CSM" - Repeat for all of the different marks you intent to analyze/integrate (e.g. with the INPUT enrichment we provided).
Example input from ./src/data/Naive_CD4T/chr1_10000bp_CMSIP.count, our Naive_CD4T's 5hmC enrichment file
chr1 3000000 3010000 250
chr1 3010000 3020000 225
chr1 3020000 3030000 222
...
chr1 195340000 195350000 236
chr1 195350000 195360000 44
chr1 195360000 195370000 202
4. Running the Two Main Codes
First run process_inputs_EGA.py on the command line from the src directory. After completion you are ready to train your network using run_models_EGA.py. For instance:
```
cd ./src
python processinputsEGA.py -c B00 -rf 0
wait (est. 30-40 mins)
python runmodelsEGA.py -c B00 -rf 0 ```
This will run the model for the B00 cell type using the classification task. The inputs to these flags can be changed so that the model can run for different cell lines (-c) as well as for either classification (-rf 0) or regression (-rf 1). Check inside each script for additional flag options.
5. Running GNNExplainer (Visualizing Results)
After the models have been trained, and the scores reported, you can run our expanded implementation of GNNExplainer. You can select an specific node (-n) you want explained or can provide a gene name (-gn) of interest, excluding genes located in chrX, chrY and chrM.
python ./src/GNNExplainNode.py \
--UsePredExp \
-od ./src/data/Nodes_Explained/Top10 \
-ep 1500 \
-rf 0 \
-hm 2 \
-cr 10000 \
-df data \
--DrawWeight \
--DrawCoords\
-mt ./src/data/Naive_CD4T_CMS_10000bp/saved_runs/model_2021-11-29-at-21-23-15.pt \
-mo Naive_CD4T \
-c Naive_CD4T

Citing our work
TBU -- current state is journal pre-release.
Owner
- Login: ay-lab
- Kind: user
- Repositories: 6
- Profile: https://github.com/ay-lab
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: "Graph 5hmC Convolutional Network"
version: 1.0.0
date-released: 2024-05
authors:
- family-names: "González-Avalos"
given-names: "Edahí"
orcid: "https://orcid.org/0000-0002-6817-4854"
affiliation: "University of California San Diego; La Jolla Institute for Immunology"
email: "edahi@lji.org"
- family-names: "Ay"
given-names: "Ferhat"
orcid: "https://orcid.org/0000-0002-0708-6914"
affiliation: "La Jolla Institute for Immunology; University of California San Diego"
email: "ferhatay@lji.org"
references:
- type: miscellaneous
name: "Software Testing by Dante Bolzan"
details: "Dante tested the repository ensuring reproducibility using provided example data in Zenodo."