SGCP

SGCP: a spectral self-learning method for clustering genes in co-expression networks

Science Score: 46.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: ncbi.nlm.nih.gov, springer.com
✓
Committers with academic emails
1 of 3 committers (33.3%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.0%) to scientific vocabulary

Keywords

bioinformatics clustering genecoexpressionnetwork graphs networkclustering networks self-training semi-supervised-learning unsupervised-learning

Last synced: 6 months ago · JSON representation

Repository

SGCP: a spectral self-learning method for clustering genes in co-expression networks

Basic Info

Host: GitHub
Owner: na396
License: gpl-2.0
Language: R
Default Branch: main
Homepage:
Size: 5.21 MB

Statistics

Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

bioinformatics clustering genecoexpressionnetwork graphs networkclustering networks self-training semi-supervised-learning unsupervised-learning

Created over 3 years ago · Last pushed over 1 year ago

Metadata Files

Readme

SGCP: a spectral self-learning method for clustering genes in co-expression networks, link

SGCP Introduction

The Self-training Gene Clustering Pipeline (SGCP) is an innovative framework for constructing and analyzing gene co-expression networks. Its primary objective is to group genes with similar expression patterns into cohesive clusters, often referred to as modules. SGCP introduces several novel steps that enable the computation of highly enriched gene modules in an unsupervised manner. What sets SGCP apart from existing frameworks is its integration of a semi-supervised clustering approach, which leverages Gene Ontology (GO) information. This unique step significantly enhances the quality of the resulting modules, producing highly enriched and biologically relevant clusters.

SGCP Publication

SGCP is available at BMC Bioinformatics.

SGCP Installation

For detailed instructions and steps, please refer to the SGCP manual on Bioconductor page. To install the latest version of SGCP, you can access the GitHub repository using the following command: ```{r}

install.packages("devtools")

devtools::install_github("na396/SGCP")

```

SGCP license

GPL-3

SGCP encoding

UTF-8

SGCP Input

SGCP requires three main inputs; expData , geneID, and annotation_db. * expData: This is a matrix or dataframe of size m*n where m represents the number of genes and n represents the number of samples. It can contain data from either DNA-microarray or RNA-seq experiments . Note that SGCP assumes that pre-processing steps, such as normalization and batch effect corection, have already been performed, as these are not handled by the pipeline. * geneID: A vector of gene identifier corresponding to the rows in expData. * anotation_db: The name of a genome-wide annotation package for the organism of interest, used in the gene ontology (GO) enrichment step. The annotation_db package must be installed by user prior to using SGCP.

Below are some commonly used annotation_db packages along with their corresponding gene identifiers for different organisms.

|organism | annotation_db | gene identifier | |:----------------------------|:--------------:|:--------------------- | |Homo sapiens (Hs) | org.Hs.eg.db | Entrez Gene identifiers | |Drosophila melanogaster (Dm) | org.Dm.eg.db | Entrez Gene identifiers | |Rattus norvegicus (Rn) | org.Rn.eg.db | Entrez Gene identifiers | |Mus musculus (Mm) | org.Mm.eg.db | Entrez Gene identifiers | |Arabidopsis thaliana (At) | org.At.tair.db | TAIR identifiers |

Gene expression datasets for your analysis can be obtained from the Gene Expression Omnibus, a public repository of high-throughput gene expression data.

SGCP Input Cleaning

In SGCP, the following assumptions are made about the input genes:

Genes must have expression values available across all samples, with no missing values.
Genes must exhibit non-zero variance in expression across all samples.
ach gene must have exactly one unique identifier, specified by geneID.
Genes must be annotated with Gene Ontology (GO) terms.

SGCP Input Example

Here, we give a brief example of the SGCP input. For this documentation, we use the gene expression GSE181225. For more information visit its Gene Expression Omnibus page).

Throughout this section, several Bioconductor packages will be required. Make sure to install and load them as needed to follow the example.

```{r} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")

BiocManager::install(c("org.Hs.eg.db", "GEOquery", "AnnotationDbi")) ```

First, set the directory ```{r}

Display the current working directory

print(getwd())

If necessary, change the path below to the directory where the data files are stored.

"." means current directory. On Windows use a forward slash / instead of the usual .

workingDir = "." setwd(workingDir) ```

First, we need to download the gene expression file. The R package GEOquery is used to obtain gene expression data from the Gene Expression Omnibus. For detailed information on how to use GEOquery, refer to the GEOquery guide.

To download the expression file for GSE181225, visit its Gene Expression Omnibus page. On the page, locate the fileGSE181225_LNCaP_p57_VO_and_p57_PIM1_RNA_Seq_normalizedCounts.txt.gz in the Supplementary files section, which contains the normalized gene expression data. Download this supplementary file and save it to the directory specified by baseDir.

```{r}

library(GEOquery)

gse = getGEOSuppFiles("GSE181225", baseDir = getwd())

``After downloading the file, you should find a new directory namedGSE181225, which contains the gene expression file. To proceed, read the gene expression file into R. The file has the following structure: * TheSymbol` column contains the gene symbols. * The remaining four columns represent different samples.

{r} df = read.delim("GSE181225/GSE181225_LNCaP_p57_VO_and_p57_PIM1_RNA_Seq_normalizedCounts.txt.gz") head(df)

Next, create the expData, geneID, and annotation_db. ```{r} geneID = df[,1]

expData = df[, 2:ncol(df)] rownames(expData) = geneID

library(org.Hs.eg.db) ``To map gene symbols to Entrez identifiers using the __annotation_db__, you can use theselectfunction from theAnnotationDbi` package. Here’s how you can do it in R:

```{r} library(AnnotationDbi)

genes = AnnotationDbi::select(org.Hs.eg.db, keys = rownames(expData), columns=c("ENTREZID"), keytype="SYMBOL")

initial dimension

print(dim(genes)) head(genes) ``Remove genes with missingSYMBOLorENTREZID`.

```{r} genes = genes[!is.na(genes$SYMBOL), ] genes = genes[!is.na(genes$ENTREZID), ]

dimension after dropping missing values

print(dim(genes)) head(genes) ```

Remove genes with duplicated SYMBOL or ENTREZID. ```{r} genes = genes[!duplicated(genes$SYMBOL),] genes = genes[!duplicated(genes$ENTREZID), ]

dimension after dropping missing values

print(dim(genes)) print(head(genes)) ```

Keep only rows in expData that have corresponding gene identifiers present in genes.

{r} expData = data.frame(expData, SYMBOL = rownames(expData)) expData = merge(expData, genes, by = "SYMBOL")

Produce expData. {r} rownames(expData) = expData$ENTREZID expData = expData[, c(2:6)] print(head(expData))

Remove genes with zero variance from expData.

```{r}

Dropping zero variance genes

vars = apply(expData, 1, var) zeroInd = which(vars == 0)

if(length(zeroInd) != 0) { print(paste0("number of zero variance genes ", length(zeroInd))) expData = expData[-zeroInd, ] genes = genes[-zeroInd, ] }

print(paste0("number of genes after dropping ", dim(genes)[1])) ``` Remove genes with no gene ontology mapping.

```{r}

Remove genes with no GO mapping

xx = as.list(org.Hs.egGO[genes$ENTREZID]) haveGO = sapply(xx, function(x) {if (length(x) == 1 && is.na(x)) FALSE else TRUE }) numNoGO = sum(!haveGO) if(numNoGO != 0){ print(paste0("number of genes with no GO mapping ", length(zeroInd))) expData = expData[haveGO, ] genes = genes[haveGO, ]

} print(paste0("number of genes after dropping ", dim(genes)[1])) ``Produce the final __expData__, __geneID__, __annotation_db__. Now, the input is ready forSGCP. Refer to [SGCP Bioconductor page](https://bioconductor.org/packages/release/bioc/html/SGCP.html) in order to see how to use this input inSGCP`.

```{r} expData = expData print(head(expData))

geneID = genes$ENTREZID print(head(geneID))

annotation_db = "org.Hs.eg.db" ```

Owner

Name: Niloofar AghaieAbiane
Login: na396
Kind: user
Location: New York, New York, USA
Company: JP Morgan & Chase Co

Website: https://web.njit.edu/~na396
Repositories: 9
Profile: https://github.com/na396

Computer Scientist | Machine Learning | Deep Learning | Statistical Modeling | Algorithm Design

GitHub Events

Total

Last Year

Committers

Last synced: about 2 years ago

All Time

Total Commits: 180
Total Committers: 3
Avg Commits per committer: 60.0
Development Distribution Score (DDS): 0.372

Past Year

Commits: 4
Committers: 1
Avg Commits per committer: 4.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
NILOOFAR AGHAIEABIANE	n**e@g**m	113
NILOOFAR AGHAIEABIANE	n**6@n**u	48
na396	n**e@g**m	19

Committer Domains (Top 20 + Academic)

gmail.cm: 1 njit.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 1 minute
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

na396 (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- bioconductor 5,873 total

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 7
Total maintainers: 1

bioconductor.org: SGCP

SGCP: A semi-supervised pipeline for gene clustering using self-training approach in gene co-expression networks

Homepage: https://github.com/na396/SGCP
Documentation: https://bioconductor.org/packages/release/bioc/vignettes/SGCP/inst/doc/SGCP.pdf
License: GPL-3
Latest release: 1.8.0
published 10 months ago

Versions: 7
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 5,873 Total

Rankings

Dependent repos count: 0.0%

Dependent packages count: 0.0%

Average: 32.7%

Downloads: 98.2%

Maintainers (1)

niloofar.abiane@gmail.com

Last synced: 6 months ago

Dependencies

DESCRIPTION cran

R >= 4.2.0 depends
DescTools * imports
GO.db * imports
GOstats * imports
RColorBrewer * imports
Rgraphviz * imports
SummarizedExperiment * imports
annotate * imports
caret * imports
dplyr * imports
expm * imports
genefilter * imports
ggplot2 * imports
ggridges * imports
grDevices * imports
methods * imports
openxlsx * imports
org.Hs.eg.db * imports
plyr * imports
reshape2 * imports
stats * imports
xtable * imports
knitr * suggests

SGCP

Science Score: 46.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

SGCP: a spectral self-learning method for clustering genes in co-expression networks, link

SGCP Introduction

SGCP Publication

SGCP Installation

install.packages("devtools")

devtools::install_github("na396/SGCP")

SGCP license

SGCP encoding

SGCP Input

SGCP Input Cleaning

SGCP Input Example

Display the current working directory

If necessary, change the path below to the directory where the data files are stored.

"." means current directory. On Windows use a forward slash / instead of the usual .

initial dimension

dimension after dropping missing values

dimension after dropping missing values

Dropping zero variance genes

Remove genes with no GO mapping

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

bioconductor.org: SGCP

Rankings

Maintainers (1)

Dependencies