https://github.com/bioconductor-source/tuberculosis

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.8%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: bioconductor-source
License: artistic-2.0
Language: R
Default Branch: devel
Size: 72.3 KB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Created almost 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Changelog Contributing License Code of conduct

README.Rmd

---
output: github_document
---



# tuberculosis 


[![code quality](https://img.shields.io/codefactor/grade/github/schifferl/tuberculosis)](https://www.codefactor.io/repository/github/schifferl/tuberculosis)


The `r BiocStyle::Biocpkg("tuberculosis")` R/Bioconductor package features tuberculosis gene expression data for machine learning. All human samples from [GEO](https://www.ncbi.nlm.nih.gov/geo/) that did not come from cell lines, were not taken postmortem, and did not feature recombination have been included. The package has more than 10,000 samples from both microarray and sequencing studies that have been processed from raw data through a hyper-standardized, reproducible pipeline.

## The Pipeline

To fully understand the provenance of data in the `r BiocStyle::Biocpkg("tuberculosis")` R/Bioconductor package, please see the [tuberculosis.pipeline](https://github.com/schifferl/tuberculosis.pipeline) GitHub repository; however, all users beyond the extremely curious can ignore these details without consequence. Yet, a brief summary of data processing is appropriate here. Microarray data were processed from raw files (e.g. `CEL` files) and background corrected using the normal-exponential method and the saddle-point approximation to maximum likelihood as implemented in the `r BiocStyle::Biocpkg("limma")` R/Bioconductor package; no normalization of expression values was done; where platforms necessitated it, the RMA (robust multichip average) algorithm without background correction or normalization was used to generate an expression matrix. Sequencing data were processed from raw files (i.e. `fastq` files) using the [nf-core/rnaseq](https://nf-co.re/rnaseq/1.4.2) pipeline inside a Singularity container; the GRCh38 genome build was used for alignment. Gene names for both microarray and sequencing data are HGNC-approved GRCh38 gene names from the [genenames.org](https://www.genenames.org/) REST API.

## Installation

To install `r BiocStyle::Biocpkg("tuberculosis")` from Bioconductor, use `r BiocStyle::CRANpkg("BiocManager")` as follows.

```{r, eval = FALSE}
BiocManager::install("tuberculosis")
```

To install `r BiocStyle::Biocpkg("tuberculosis")` from GitHub, use `r BiocStyle::CRANpkg("BiocManager")` as follows.

```{r, eval = FALSE}
BiocManager::install("schifferl/tuberculosis", dependencies = TRUE, build_vignettes = TRUE)
```

Most users should simply install `r BiocStyle::Biocpkg("tuberculosis")` from Bioconductor.

## Load Package

To use the package without double colon syntax, it should be loaded as follows.

```{r, message = FALSE}
library(tuberculosis)
```

The package is lightweight, with few dependencies, and contains no data itself.

## Finding Data

To find data, users will use the `tuberculosis` function with a regular expression pattern to list available resources. The resources are organized by [GEO](https://www.ncbi.nlm.nih.gov/geo/) series accession numbers. If multiple platforms were used in a single study, the platform accession number follows the series accession number and is separated by a dash. The date before the series accession number denotes the date the resource was created.

```{r}
tuberculosis("GSE103147")
```

The function will print the names of matching resources as a message and return them invisibly as a character vector. To see all available resources use `"."` for the `pattern` argument.

## Getting Data

To get data, users will also use the `tuberculosis` function, but with an additional argument, `dryrun = FALSE`. This will either download resources from `r BiocStyle::Biocpkg("ExperimentHub")` or load them from the user's local cache. If a resource has multiple creation dates, the most recent is selected by default; add a date to override this behavior.

```{r}
tuberculosis("GSE103147", dryrun = FALSE)
```

The function returns a `list` of `SummarizedExperiment` objects, each with a single assay, `exprs`, where the rows are features (genes) and the columns are observations (samples). If multiple resources are requested, multiple resources will be returned, each as a `list` element.

```{r}
tuberculosis("GSE10799.", dryrun = FALSE)
```

The `assay` of each `SummarizedExperiment` object is named `exprs` rather than `counts` because it can come from either a microarray or a sequencing platform. If `colnames` begin with `GSE`, data comes from a microarray platform; if `colnames` begin with `SRR`, data comes from a sequencing platform.

## No Metadata?

The `SummarizedExperiment` objects do not have sample metadata as `colData`, and this limits their use to unsupervised analyses for the time being. Sample metadata are currently undergoing manual curation, with the same level of diligence that was applied in data processing, and will be included in the package when they are ready.

## Contributing

To contribute to the `r BiocStyle::Biocpkg("tuberculosis")` R/Bioconductor package, first read the [contributing guidelines](CONTRIBUTING.md) and then open an issue. Also note that in contributing you agree to abide by the [code of conduct](CODE_OF_CONDUCT.md).

Owner

Name: (WIP DEV) Bioconductor Packages
Login: bioconductor-source
Kind: organization
Email: maintainer@bioconductor.org

Website: https://bioconductor.org
Repositories: 1
Profile: https://github.com/bioconductor-source

Source code for packages accepted into Bioconductor

GitHub Events

Total

Last Year

Dependencies

DESCRIPTION cran

R >= 4.1.0 depends
SummarizedExperiment * depends
AnnotationHub * imports
ExperimentHub * imports
S4Vectors * imports
dplyr * imports
magrittr * imports
purrr * imports
rlang * imports
stringr * imports
tibble * imports
tidyr * imports
BiocStyle * suggests
ggplot2 * suggests
hrbrthemes * suggests
knitr * suggests
readr * suggests
rmarkdown * suggests
scater * suggests
usethis * suggests
utils * suggests

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science