SGCP
SGCP: a spectral self-learning method for clustering genes in co-expression networks
Science Score: 46.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: ncbi.nlm.nih.gov, springer.com -
✓Committers with academic emails
1 of 3 committers (33.3%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.0%) to scientific vocabulary
Keywords
Repository
SGCP: a spectral self-learning method for clustering genes in co-expression networks
Basic Info
Statistics
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
SGCP: a spectral self-learning method for clustering genes in co-expression networks, link
SGCP Introduction
The Self-training Gene Clustering Pipeline (SGCP) is an innovative framework for constructing and analyzing gene co-expression networks. Its primary objective is to group genes with similar expression patterns into cohesive clusters, often referred to as modules. SGCP introduces several novel steps that enable the computation of highly enriched gene modules in an unsupervised manner. What sets SGCP apart from existing frameworks is its integration of a semi-supervised clustering approach, which leverages Gene Ontology (GO) information. This unique step significantly enhances the quality of the resulting modules, producing highly enriched and biologically relevant clusters.
SGCP Publication
SGCP is available at BMC Bioinformatics.
SGCP Installation
For detailed instructions and steps, please refer to the SGCP manual on
Bioconductor page. To install the latest version of SGCP, you can access the GitHub repository using the following command:
```{r}
install.packages("devtools")
devtools::install_github("na396/SGCP")
```
SGCP license
GPL-3
SGCP encoding
UTF-8
SGCP Input
SGCP requires three main inputs; expData , geneID, and annotation_db.
* expData: This is a matrix or dataframe of size m*n where m represents the number of genes and n represents the number of samples. It can contain data from either DNA-microarray or RNA-seq experiments . Note that SGCP assumes that pre-processing steps, such as normalization and batch effect corection, have already been performed, as these are not handled by the pipeline.
* geneID: A vector of gene identifier corresponding to the rows in expData.
* anotation_db: The name of a genome-wide annotation package for the organism of interest, used in the gene ontology (GO) enrichment step. The annotation_db package must be installed by user prior to using SGCP.
Below are some commonly used annotation_db packages along with their corresponding gene identifiers for different organisms.
|organism | annotation_db | gene identifier | |:----------------------------|:--------------:|:--------------------- | |Homo sapiens (Hs) | org.Hs.eg.db | Entrez Gene identifiers | |Drosophila melanogaster (Dm) | org.Dm.eg.db | Entrez Gene identifiers | |Rattus norvegicus (Rn) | org.Rn.eg.db | Entrez Gene identifiers | |Mus musculus (Mm) | org.Mm.eg.db | Entrez Gene identifiers | |Arabidopsis thaliana (At) | org.At.tair.db | TAIR identifiers |
Gene expression datasets for your analysis can be obtained from the Gene Expression Omnibus, a public repository of high-throughput gene expression data.
SGCP Input Cleaning
In SGCP, the following assumptions are made about the input genes:
- Genes must have expression values available across all samples, with no missing values.
- Genes must exhibit non-zero variance in expression across all samples.
- ach gene must have exactly one unique identifier, specified by geneID.
- Genes must be annotated with Gene Ontology (GO) terms.
SGCP Input Example
Here, we give a brief example of the SGCP input. For this documentation, we use the gene expression GSE181225. For more information visit its Gene Expression Omnibus page).
Throughout this section, several Bioconductor packages will be required. Make sure to install and load them as needed to follow the example.
```{r} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("org.Hs.eg.db", "GEOquery", "AnnotationDbi")) ```
First, set the directory ```{r}
Display the current working directory
print(getwd())
If necessary, change the path below to the directory where the data files are stored.
"." means current directory. On Windows use a forward slash / instead of the usual .
workingDir = "." setwd(workingDir) ```
First, we need to download the gene expression file. The R package GEOquery is used to obtain gene expression data from the Gene Expression Omnibus. For detailed information on how to use GEOquery, refer to the GEOquery guide.
To download the expression file for GSE181225, visit its Gene Expression Omnibus page. On the page, locate the fileGSE181225_LNCaP_p57_VO_and_p57_PIM1_RNA_Seq_normalizedCounts.txt.gz in the Supplementary files section, which contains the normalized gene expression data. Download this supplementary file and save it to the directory specified by baseDir.
```{r}
library(GEOquery)
gse = getGEOSuppFiles("GSE181225", baseDir = getwd())
``
After downloading the file, you should find a new directory namedGSE181225, which contains the gene expression file. To proceed, read the gene expression file into R. The file has the following structure:
* TheSymbol` column contains the gene symbols.
* The remaining four columns represent different samples.
{r}
df = read.delim("GSE181225/GSE181225_LNCaP_p57_VO_and_p57_PIM1_RNA_Seq_normalizedCounts.txt.gz")
head(df)
Next, create the expData, geneID, and annotation_db. ```{r} geneID = df[,1]
expData = df[, 2:ncol(df)] rownames(expData) = geneID
library(org.Hs.eg.db)
``
To map gene symbols to Entrez identifiers using the __annotation_db__, you can use theselectfunction from theAnnotationDbi` package. Here’s how you can do it in R:
```{r} library(AnnotationDbi)
genes = AnnotationDbi::select(org.Hs.eg.db, keys = rownames(expData), columns=c("ENTREZID"), keytype="SYMBOL")
initial dimension
print(dim(genes))
head(genes)
``
Remove genes with missingSYMBOLorENTREZID`.
```{r} genes = genes[!is.na(genes$SYMBOL), ] genes = genes[!is.na(genes$ENTREZID), ]
dimension after dropping missing values
print(dim(genes)) head(genes) ```
Remove genes with duplicated SYMBOL or ENTREZID.
```{r}
genes = genes[!duplicated(genes$SYMBOL),]
genes = genes[!duplicated(genes$ENTREZID), ]
dimension after dropping missing values
print(dim(genes)) print(head(genes)) ```
Keep only rows in expData that have corresponding gene identifiers present in genes.
{r}
expData = data.frame(expData, SYMBOL = rownames(expData))
expData = merge(expData, genes, by = "SYMBOL")
Produce expData.
{r}
rownames(expData) = expData$ENTREZID
expData = expData[, c(2:6)]
print(head(expData))
Remove genes with zero variance from expData.
```{r}
Dropping zero variance genes
vars = apply(expData, 1, var) zeroInd = which(vars == 0)
if(length(zeroInd) != 0) { print(paste0("number of zero variance genes ", length(zeroInd))) expData = expData[-zeroInd, ] genes = genes[-zeroInd, ] }
print(paste0("number of genes after dropping ", dim(genes)[1])) ``` Remove genes with no gene ontology mapping.
```{r}
Remove genes with no GO mapping
xx = as.list(org.Hs.egGO[genes$ENTREZID]) haveGO = sapply(xx, function(x) {if (length(x) == 1 && is.na(x)) FALSE else TRUE }) numNoGO = sum(!haveGO) if(numNoGO != 0){ print(paste0("number of genes with no GO mapping ", length(zeroInd))) expData = expData[haveGO, ] genes = genes[haveGO, ]
}
print(paste0("number of genes after dropping ", dim(genes)[1]))
``
Produce the final __expData__, __geneID__, __annotation_db__. Now, the input is ready forSGCP. Refer to
[SGCP Bioconductor page](https://bioconductor.org/packages/release/bioc/html/SGCP.html) in order to see how to use this input inSGCP`.
```{r} expData = expData print(head(expData))
geneID = genes$ENTREZID print(head(geneID))
annotation_db = "org.Hs.eg.db" ```
Owner
- Name: Niloofar AghaieAbiane
- Login: na396
- Kind: user
- Location: New York, New York, USA
- Company: JP Morgan & Chase Co
- Website: https://web.njit.edu/~na396
- Repositories: 9
- Profile: https://github.com/na396
Computer Scientist | Machine Learning | Deep Learning | Statistical Modeling | Algorithm Design
GitHub Events
Total
Last Year
Committers
Last synced: about 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| NILOOFAR AGHAIEABIANE | n****e@g****m | 113 |
| NILOOFAR AGHAIEABIANE | n****6@n****u | 48 |
| na396 | n****e@g****m | 19 |
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: 1 minute
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- na396 (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- bioconductor 5,873 total
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 7
- Total maintainers: 1
bioconductor.org: SGCP
SGCP: A semi-supervised pipeline for gene clustering using self-training approach in gene co-expression networks
- Homepage: https://github.com/na396/SGCP
- Documentation: https://bioconductor.org/packages/release/bioc/vignettes/SGCP/inst/doc/SGCP.pdf
- License: GPL-3
-
Latest release: 1.8.0
published 10 months ago
Rankings
Maintainers (1)
Dependencies
- R >= 4.2.0 depends
- DescTools * imports
- GO.db * imports
- GOstats * imports
- RColorBrewer * imports
- Rgraphviz * imports
- SummarizedExperiment * imports
- annotate * imports
- caret * imports
- dplyr * imports
- expm * imports
- genefilter * imports
- ggplot2 * imports
- ggridges * imports
- grDevices * imports
- methods * imports
- openxlsx * imports
- org.Hs.eg.db * imports
- plyr * imports
- reshape2 * imports
- stats * imports
- xtable * imports
- knitr * suggests