BgeeDB

Source code of the R packge BgeeDB to use data from the Bgee database

https://github.com/bgeedb/bgeedb_r

Keywords

bioinformatics enrichment-analysis rna-seq scrna-seq single-cell

Keywords from Contributors

bioconductor-package genomics gene transcriptomics bioconductor u24ca289073 core-package core-services tcga mutations

Last synced: 6 months ago · JSON representation

Repository

Source code of the R packge BgeeDB to use data from the Bgee database

Basic Info

Host: GitHub
Owner: BgeeDB
License: gpl-3.0
Language: R
Default Branch: master
Homepage: https://bioconductor.org/packages/BgeeDB/
Size: 576 KB

Statistics

Stars: 16
Watchers: 6
Forks: 6
Open Issues: 19
Releases: 2

Topics

bioinformatics enrichment-analysis rna-seq scrna-seq single-cell

Created almost 10 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Code of conduct

BgeeDB, an R package for retrieval of curated expression datasets and for gene list enrichment tests

Andrea Komljenović, Julien Roux, Marc Robinson-Rechavi, Frédéric Bastian

BgeeDB is a collection of functions to import data from the Bgee database (http://bgee.org/) directly into R, and to facilitate downstream analyses, such as gene set enrichment test based on expression of genes in anatomical structures. Bgee provides annotated and processed expression data and expression calls from curated wild-type healthy samples, from human and many other animal species.

The package retrieves the annotation of RNA-seq or Affymetrix experiments integrated into the Bgee database, and downloads into R the quantitative data and expression calls produced by the Bgee pipeline. The package also allows to run GO-like enrichment analyses based on anatomical terms, where genes are mapped to anatomical terms by expression patterns, based on the topGO package. This is the same as the TopAnat web-service available at (http://bgee.org/?page=top_anat#/), but with more flexibility in the choice of parameters and developmental stages.

In summary, the BgeeDB package allows to: * 1. List annotation of RNA-seq and microarray data available the Bgee database * 2. Download the processed gene expression data available in the Bgee database * 3. Download the gene expression calls and use them to perform TopAnat analyses

The pipeline used to generate Bgee expression data is documented and publicly available at (https://github.com/BgeeDB/bgee_pipeline)

If you found a bug or have any issues to use BgeeDB please write a bug report in our own GitHub issues manager available at (https://github.com/BgeeDB/BgeeDB_R/issues)

Installation

In R: {r} if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("BgeeDB")

How to use BgeeDB package

Load the package

{r, message = FALSE, warning = FALSE} library(BgeeDB)

Running example: downloading and formatting processed RNA-seq data

List available species in Bgee

The listBgeeSpecies() function allows to retrieve available species in the Bgee database, and which data types are available for each species.

{r} listBgeeSpecies()

It is possible to list all species from a specific release of Bgee with the release argument (see listBgeeRelease() function), and order the species according to a specific columns with the ordering argument. For example:

{r} listBgeeSpecies(release = "13.2", order = 2)

Create a new Bgee object

In the following example we will choose to focus on mouse ("Mus_musculus") RNA-seq. Species can be specified using their name or their NCBI taxonomic IDs. To specify that RNA-seq data want to be downloaded, the dataType argument is set to "rna_seq". To download Affymetrix microarray data, set this argument to "affymetrix".

{r} bgee <- Bgee$new(species = "Mus_musculus", dataType = "rna_seq")

Note 1: It is possible to work with data from a specific release of Bgee by specifying the release argument, see listBgeeRelease() function.

Note 2: The functions of the package will store the downloaded files in a versioned folder created by default in the working directory. These cache files allow faster re-access to the data. The directory where data are stored can be changed with the pathToData argument.

Retrieve the annotation of mouse RNA-seq datasets

The getAnnotation() function will output the list of RNA-seq experiments and libraries available in Bgee for mouse.

``` {r} annotationbgeemouse <- getAnnotation(bgee)

list the first experiments and libraries

lapply(annotationbgeemouse, head) ```

Download the processed mouse RNA-seq data

The getData() function will download processed RNA-seq data from all mouse experiments in Bgee as a list.

``` {r}

download all RNA-seq experiments from mouse

databgeemouse <- getData(bgee)

number of experiments downloaded

length(databgeemouse)

check the downloaded data

lapply(databgeemouse, head)

isolate the first experiment

databgeeexperiment1 <- databgeemouse[[1]] ```

The result of the getData() function is a data frame. Each row is a gene and the expression levels are displayed as raw read counts, RPKMs (up to Bgee 13.2), TPMs (from Bgee 14.0), or FPKMs (from Bgee 14.0). A detection flag indicates if the gene is significantly expressed above background level of expression.

Note: If microarray data are downloaded, rows corresponding to probesets and log2 of expression intensities are available instead of read counts/RPKMs/TPMs/FPKMs.

Alternatively, you can choose to download data from only one particular RNA-seq experiment from Bgee with the experimentId parameter:

``` {r}

download data for GSE30617

databgeemouse_gse30617 <- getData(bgee, experimentId = "GSE30617") ```

Format the RNA-seq data

It is sometimes easier to work with data organized as a matrix, where rows represent genes or probesets and columns represent different samples. The formatData() function reformats the data into an ExpressionSet object including: * An expression data matrix, with genes or probesets as rows, and samples as columns (assayData slot). The stats argument allows to choose if the matrix should be filled with read counts, RPKMs (up to Bgee 13.2), FPKMs (from Bgee 14.0), or TPMs (from Bgee 14.0) for RNA-seq data. For microarray data the matrix is filled with log2 expression intensities. * A data frame listing the samples and their anatomical structure and developmental stage annotation (phenoData slot) * For microarray data, the mapping from probesets to Ensembl genes (featureData slot)

The callType argument allows to retain only actively expressed genes or probesets, if set to "present" or "present high quality". Genes or probesets that are absent in a given sample are given NA values.

```{r}

use only present calls and fill expression matric with FPKM values

gene.expression.mouse.fpkm <- formatData(bgee, databgeemouse_gse30617, callType = "present", stats = "fpkm") gene.expression.mouse.fpkm ```

Running example: TopAnat gene expression enrichment analysis

For some documentation on the TopAnat analysis, please refer to our publications, or to the web-tool page (http://bgee.org/?page=top_anat#/).

Create a new Bgee object

Similarly to the quantitative data download example above, the first step of a topAnat analysis is to built an object from the Bgee class. For this example, we will focus on zebrafish:

```{r}

Creating new Bgee class object

bgee <- Bgee$new(species = "Danio_rerio") ```

Note : We are free to specify any data type of interest using the dataType argument among rna_seq, affymetrix, est or in_situ, or even a combination of data types. If nothing is specified, as in the above example, all data types available for the targeted species are used. This equivalent to specifying dataType=c("rna_seq","affymetrix","est","in_situ").

Download the data allowing to perform TopAnat analysis

The loadTopAnatData() function loads a mapping from genes to anatomical structures based on calls of expression in anatomical structures. It also loads the structure of the anatomical ontology and the names of anatomical structures.

```{r}

Loading calls of expression

myTopAnatData <- loadTopAnatData(bgee)

Look at the data

str(myTopAnatData) ```

The strigency on the quality of expression calls can be changed with the confidence argument. Finally, if you are interested in expression data coming from a particular developmental stage or a group of stages, please specify the a Uberon stage Id in the stage argument.

```{r, eval=FALSE}

Loading silver and gold expression calls from affymetrix data made on embryonic samples only

This is just given as an example, do not run this code because only few data will be returned by the TopAnat gene expression enrichment analysis

bgee <- Bgee$new(species = "Danio_rerio", dataType="affymetrix") myTopAnatData <- loadTopAnatData(bgee, stage="UBERON:0000068", confidence="silver") ```

Note 1: As mentioned above, the downloaded data files are stored in a versioned folder that can be set with the pathToData argument when creating the Bgee class object (default is the working directory). If you query again Bgee with the exact same parameters, these cached files will be read instead of querying the web-service again. It is thus important, if you plan to reuse the same data for multiple parallel topAnat analyses, to plan to make use of these cached files instead of redownloading them for each analysis. The cached files also give the possibility to repeat analyses offline.

Note 2: In releases up to Bgee 13.2 allowed confidence`` values were `low_quality` or or `high_quality`. From Bgee 14.0confidence``values aregoldorsilver`.

Prepare a topAnatData object allowing to perform TopAnat analysis with topGO

First we need to prepare a list of genes in the foreground and in the background. The input format is the same as the gene list required to build a topGOdata object in the topGO package: a vector with background genes as names, and 0 or 1 values depending if a gene is in the foreground or not. In this example we will look at genes with an annotated phenotype related to "pectoral fin" . We use the biomaRt package to retrieve this list of genes. We expect them to be enriched for expression in male tissues, notably the testes. The background list of genes is set to all genes annotated to at least one Gene Ontology term, allowing to account for biases in which types of genes are more likely to receive Gene Ontology annotation.

```{r}

if (!requireNamespace("BiocManager", quietly=TRUE))

# install.packages("BiocManager")

BiocManager::install("biomaRt")

library(biomaRt) ensembl <- useMart("ENSEMBLMARTENSEMBL", dataset="dreriogeneensembl", host="mar2016.archive.ensembl.org")

get the mapping of Ensembl genes to phenotypes. It will corresponds to the background genes

universe <- getBM(filters=c("phenotypesource"), value=c("ZFIN"), attributes=c("ensemblgeneid","phenotypedescription"), mart=ensembl)

select phenotypes related to pectoral fin

phenotypes <- grep("pectoral fin", unique(universe$phenotype_description), value=T)

Foreground genes are those with an annotated phenotype related to "pectoral fin"

myGenes <- unique(universe$ensemblgeneid[universe$phenotype_description %in% phenotypes])

Prepare the gene list vector

geneList <- factor(as.integer(unique(universe$ensemblgeneid) %in% myGenes)) names(geneList) <- unique(universe$ensemblgeneid) summary(geneList)

Prepare the topGO object

myTopAnatObject <- topAnat(myTopAnatData, geneList) ```

Warning: This can be long, especially if the gene list is large, since the Uberon anatomical ontology is large and expression calls will be propagated through the whole ontology (e.g., expression in the forebrain will also be counted as expression in parent structures such as the brain, nervous system, etc). Consider running a script in batch mode if you have multiple analyses to do.

Launch the enrichment test

For this step, see the vignette of the topGO package for more details, as you have to directly use the tests implemented in the topGO package, as shown in this example:

{r} results <- runTest(myTopAnatObject, algorithm = 'weight', statistic = 'fisher')

Warning: This can be long because of the size of the ontology. Consider running scripts in batch mode if you have multiple analyses to do.

Format the table of results after an enrichment test for anatomical terms

The makeTable function allows to filter and format the test results, and calculate FDR values.

```{r}

Display results sigificant at a 1% FDR threshold

tableOver <- makeTable(myTopAnatData, myTopAnatObject, results, cutoff = 0.01) head(tableOver) ```

At the time of building this README (June 2018), there was 27 significant anatomical structures. The first term is “pectoral fin”, and the second “paired limb/fin bud”. Other terms in the list, especially those with high enrichment folds, are clearly related to pectoral fins or substructures of fins. This analysis shows that genes with phenotypic effects on pectoral fins are specifically expressed in or next to these structures

By default results are sorted by p-value, but this can be changed with the ordering parameter by specifying which column should be used to order the results (preceded by a "-" sign to indicate that ordering should be made in decreasing order). For example, it is often convenient to sort significant structures by decreasing enrichment fold, using ordering = -6. The full table of results can be obtained using cutoff = 1.

Of note, it is possible to retrieve for a particular tissue the significant genes that were mapped to it.

```{r}

In order to retrieve significant genes mapped to the term " paired limb/fin bud"

term <- "UBERON:0004357" termStat(myTopAnatObject, term)

198 genes mapped to this term for Bgee 14.0 and Ensembl 84

genesInTerm(myTopAnatObject, term)

48 significant genes mapped to this term for Bgee 14.0 and Ensembl 84

annotated <- genesInTerm(myTopAnatObject, term)[["UBERON:0004357"]] annotated[annotated %in% sigGenes(myTopAnatObject)] ```

Warning: it is debated whether FDR correction is appropriate on enrichment test results, since tests on different terms of the ontologies are not independent. A nice discussion can be found in the vignette of the topGO package.

Store expression data localy

Since version 2.14.0 (Bioconductor 3.11) BgeeDB store downloaded expression data in a local RSQLite database. The advantages of this approach compared to the one used in the previous BgeeDB versions are: * do not anymore need a server with lot of memory to access to subset of huge dataset (e.g GTeX dataset) * more granular filtering using arguments in the getData() function * do not download twice the same data * fast access to data once integrated in the local database

This approach comes with some drawbacks : * the SQLite local database use more disk space than the previously conpressed .rds approach * first access to a dataset takes more time (integration to SQLite local database is time consuming)

It is possible to remove .rds files generated in previous versions of BgeeDB and not used anymore since version 2.14.0. The function below delete all .rds files for the selected species and for all datatype.

```{r eval = FALSE} bgee <- Bgee$new(species="Mus_musculus", release = "14.1")

delete all old .rds files of species Mus musculus

deleteOldData(bgee) ```

As the new SQLite approach use more disk space it is now possible to delete all local data of one species from one release of Bgee.

```{r eval = FALSE} bgee <- Bgee$new(species="Mus_musculus", release = "14.1")

delete local SQLite database of species Mus musculus from Bgee 14.1

deleteLocalData(bgee) ```

Owner

Name: BgeeDB
Login: BgeeDB
Kind: organization

Website: https://bgee.org/
Twitter: Bgeedb
Repositories: 10
Profile: https://github.com/BgeeDB

GitHub Events

Total

Watch event: 1
Push event: 9
Create event: 1

Last Year

Watch event: 1
Push event: 9
Create event: 1

Committers

Last synced: over 2 years ago

All Time

Total Commits: 422
Total Committers: 23
Avg Commits per committer: 18.348
Development Distribution Score (DDS): 0.749

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Julien Roux	j**x@u**h	106
Julien	j**t@g**m	84
wirawara	w****a	77
wirawara	a**c@s**s	33
Julien Wollbrett	j**t@b**h	30
Andrea Komljenovic	a**c@g**m	29
Nitesh Turaga	n**a@g**m	13
Herve Pages	h**s@f**g	7
Frédéric Bastian	b**e@s**s	7
Dan Tenenbaum	d**a@f**g	6
jwollbrett	j**t@u**h	4
julien.roux@unibas.ch	j**x@u**h	4
smoretti	s****i	3
fbastian	f****n	3
SFonsecaCosta	s**a@u**h	2
Hervé Pagès	h**s@f**g	2
Martin Morgan	m**n@f**g	2
vobencha	v**a@g**m	2
Rouxj	r**j@d**h	2
vobencha	v**n@r**g	2
fbastian	f**n@g**m	2
LiNk-NY	m**z@r**g	1
LiNk-NY	m**9@g**m	1

Committer Domains (Top 20 + Academic)

fhcrc.org: 3 unil.ch: 3 roswellpark.org: 2 sib.swiss: 2 df203-228.dfusb.unibas.ch: 1 fredhutch.org: 1 unibas.ch: 1 bio-51-170.unil.ch: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 29
Total pull requests: 1
Average time to close issues: 6 months
Average time to close pull requests: N/A
Total issue authors: 9
Total pull request authors: 1
Average comments per issue: 1.9
Average comments per pull request: 5.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 1
Average time to close issues: about 17 hours
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 1
Average comments per issue: 2.0
Average comments per pull request: 5.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

julien-roux (15)
Ni-Ar (3)
jwollbrett (3)
dhimmel (2)
HenrikCordes (2)
anishahaldar (1)
marcrr (1)
shubihu (1)
fbastian (1)

Pull Request Authors

garethgillard (2)

Top Labels

Issue Labels

enhancement (9) low priority (9) high priority (5) bug (3)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- bioconductor 24,716 total

Total dependent packages: 2
Total dependent repositories: 0
Total versions: 6
Total maintainers: 4

bioconductor.org: BgeeDB

Annotation and gene expression data retrieval from Bgee database. TopAnat, an anatomical entities Enrichment Analysis tool for UBERON ontology

Homepage: https://github.com/BgeeDB/BgeeDB_R
Documentation: https://bioconductor.org/packages/release/bioc/vignettes/BgeeDB/inst/doc/BgeeDB.pdf
License: GPL-3 + file LICENSE
Latest release: 2.34.0
published 9 months ago

Versions: 6
Dependent Packages: 2
Dependent Repositories: 0
Downloads: 24,716 Total

Rankings

Dependent repos count: 0.0%

Dependent packages count: 0.0%

Stargazers count: 8.0%

Forks count: 10.0%

Average: 12.8%

Downloads: 46.0%

Maintainers (4)

julien.wollbrett@unil.ch julien.roux@unibas.ch andreakomljenovic@gmail.com bgee@sib.swiss

Last synced: 6 months ago

BgeeDB

Science Score: 46.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

BgeeDB, an R package for retrieval of curated expression datasets and for gene list enrichment tests

Andrea Komljenović, Julien Roux, Marc Robinson-Rechavi, Frédéric Bastian

Installation

How to use BgeeDB package

Load the package

Running example: downloading and formatting processed RNA-seq data

List available species in Bgee

Create a new Bgee object

Retrieve the annotation of mouse RNA-seq datasets

list the first experiments and libraries

Download the processed mouse RNA-seq data

download all RNA-seq experiments from mouse

number of experiments downloaded

check the downloaded data

isolate the first experiment

download data for GSE30617

Format the RNA-seq data

use only present calls and fill expression matric with FPKM values

Running example: TopAnat gene expression enrichment analysis

Create a new Bgee object

Creating new Bgee class object

Download the data allowing to perform TopAnat analysis

Loading calls of expression

Look at the data

Loading silver and gold expression calls from affymetrix data made on embryonic samples only

This is just given as an example, do not run this code because only few data will be returned by the TopAnat gene expression enrichment analysis

Prepare a topAnatData object allowing to perform TopAnat analysis with topGO

if (!requireNamespace("BiocManager", quietly=TRUE))

BiocManager::install("biomaRt")

get the mapping of Ensembl genes to phenotypes. It will corresponds to the background genes

select phenotypes related to pectoral fin

Foreground genes are those with an annotated phenotype related to "pectoral fin"

Prepare the gene list vector

Prepare the topGO object

Launch the enrichment test

Format the table of results after an enrichment test for anatomical terms

Display results sigificant at a 1% FDR threshold

In order to retrieve significant genes mapped to the term " paired limb/fin bud"

198 genes mapped to this term for Bgee 14.0 and Ensembl 84

48 significant genes mapped to this term for Bgee 14.0 and Ensembl 84

Store expression data localy

delete all old .rds files of species Mus musculus

delete local SQLite database of species Mus musculus from Bgee 14.1

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

bioconductor.org: BgeeDB

Rankings

Maintainers (4)

Dependencies