partition
partition: A fast and flexible framework for data reduction in R - Published in JOSS (2020)
Science Score: 98.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 8 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: joss.theoj.org, zenodo.org -
✓Committers with academic emails
1 of 4 committers (25.0%) from academic institutions -
✓Institutional organization owner
Organization uscbiostats has institutional domain (biostatsepi.usc.edu) -
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
data-reduction
dimensionality-reduction
partitional-clustering
r
Last synced: 6 months ago
·
JSON representation
Repository
A fast and flexible framework for data reduction in R
Basic Info
- Host: GitHub
- Owner: USCbiostats
- License: other
- Language: HTML
- Default Branch: master
- Homepage: https://uscbiostats.github.io/partition/
- Size: 15.1 MB
Statistics
- Stars: 37
- Watchers: 3
- Forks: 4
- Open Issues: 2
- Releases: 9
Topics
data-reduction
dimensionality-reduction
partitional-clustering
r
Created almost 7 years ago
· Last pushed over 1 year ago
Metadata Files
Readme
Changelog
Contributing
License
Code of conduct
README.Rmd
---
output: github_document
references:
- id: R-partition
type: article-journal
author:
- family: Millstein
given: Joshua
- family: Battaglin
given: Francesca
- family: Barrett
given: Malcolm
- family: Cao
given: Shu
- family: Zhang
given: Wu
- family: Stintzing
given: Sebastian
- family: Heinemann
given: Volker
- family: Lenz
given: Heinz-Josef
issued:
- year: 2020
title: 'Partition: A surjective mapping approach for dimensionality reduction'
title-short: Partition
container-title: Bioinformatics
page: 676-681
volume: '36'
issue: '3'
URL: 'https://doi.org/10.1093/bioinformatics/btz661'
params:
invalidate_cache: false
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
dpi = 320
)
```
[](https://github.com/USCbiostats/partition/actions/workflows/R-CMD-check.yaml)
[](https://app.codecov.io/github/USCbiostats/partition?branch=master)
[](https://cran.r-project.org/package=partition)
[](https://doi.org/10.21105/joss.01991)
[](https://zenodo.org/badge/latestdoi/178615892)
[](https://image.usc.edu)
# partition
partition is a fast and flexible framework for agglomerative partitioning. partition uses an approach called Direct-Measure-Reduce to create new variables that maintain the user-specified minimum level of information. Each reduced variable is also interpretable: the original variables map to one and only one variable in the reduced data set. partition is flexible, as well: how variables are selected to reduce, how information loss is measured, and the way data is reduced can all be customized.
## Installation
You can install the partition from CRAN with:
``` r
install.packages("partition")
```
Or you can install the development version of partition GitHub with:
``` r
# install.packages("remotes")
remotes::install_github("USCbiostats/partition")
```
## Example
```{r example}
library(partition)
set.seed(1234)
df <- simulate_block_data(c(3, 4, 5), lower_corr = .4, upper_corr = .6, n = 100)
# don't accept reductions where information < .6
prt <- partition(df, threshold = .6)
prt
# return reduced data
partition_scores(prt)
# access mapping keys
mapping_key(prt)
unnest_mappings(prt)
# use a lower threshold of information loss
partition(df, threshold = .5, partitioner = part_kmeans())
# use a custom partitioner
part_icc_rowmeans <- replace_partitioner(
part_icc,
reduce = as_reducer(rowMeans)
)
partition(df, threshold = .6, partitioner = part_icc_rowmeans)
```
partition also supports a number of ways to visualize partitions and permutation tests; these functions all start with `plot_*()`. These functions all return ggplots and can thus be extended using ggplot2.
```{r stacked_area_chart, dpi = 320}
plot_stacked_area_clusters(df) +
ggplot2::theme_minimal(14)
```
## Performance
partition has been meticulously benchmarked and profiled to improve performance, and key sections are written in C++ or use C++-based packages. Using a data frame with 1 million rows on a 2017 MacBook Pro with 16 GB RAM, here's how each of the built-in partitioners perform:
```{r benchmarks1, eval = FALSE}
large_df <- simulate_block_data(c(3, 4, 5), lower_corr = .4, upper_corr = .6, n = 1e6)
basic_benchmarks <- microbenchmark::microbenchmark(
icc = partition(large_df, .3),
kmeans = partition(large_df, .3, partitioner = part_kmeans()),
minr2 = partition(large_df, .3, partitioner = part_minr2()),
pc1 = partition(large_df, .3, partitioner = part_pc1()),
stdmi = partition(large_df, .3, partitioner = part_stdmi())
)
```
```{r secret_benchmarks1, echo = FALSE, warning=FALSE, message=FALSE}
library(microbenchmark)
library(ggplot2)
if (params$invalidate_cache) {
large_df <- simulate_block_data(c(3, 4, 5), lower_corr = .4, upper_corr = .6, n = 1e6)
basic_benchmarks <- microbenchmark::microbenchmark(
icc = partition(large_df, .3),
kmeans = partition(large_df, .3, partitioner = part_kmeans()),
minr2 = partition(large_df, .3, partitioner = part_minr2()),
pc1 = partition(large_df, .3, partitioner = part_pc1()),
stdmi = partition(large_df, .3, partitioner = part_stdmi())
)
readr::write_rds(basic_benchmarks, "basic_benchmarks.rds")
} else {
basic_benchmarks <- readr::read_rds("basic_benchmarks.rds")
}
basic_benchmarks$expr <- forcats::fct_reorder(basic_benchmarks$expr, basic_benchmarks$time)
ggplot2::autoplot(basic_benchmarks) %+%
ggplot2::stat_ydensity(color = "#0072B2", fill = "#0072B2BF") +
ggplot2::theme_minimal()
```
## ICC vs K-Means
As the features (columns) in the data set become greater than the number of observations (rows), the default ICC method scales more linearly than K-Means-based methods. While K-Means is often faster at lower dimensions, it becomes slower as the features outnumber the observations. For example, using three data sets with increasing numbers of columns, K-Means starts as the fastest and gets increasingly slower, although in this case it is still comparable to ICC:
```{r benchmarks2, eval = FALSE}
narrow_df <- simulate_block_data(3:5, lower_corr = .4, upper_corr = .6, n = 100)
wide_df <- simulate_block_data(rep(3:10, 2), lower_corr = .4, upper_corr = .6, n = 100)
wider_df <- simulate_block_data(rep(3:20, 4), lower_corr = .4, upper_corr = .6, n = 100)
icc_kmeans_benchmarks <- microbenchmark::microbenchmark(
icc_narrow = partition(narrow_df, .3),
icc_wide = partition(wide_df, .3),
icc_wider = partition(wider_df, .3),
kmeans_narrow = partition(narrow_df, .3, partitioner = part_kmeans()),
kmeans_wide = partition(wide_df, .3, partitioner = part_kmeans()),
kmeans_wider = partition(wider_df, .3, partitioner = part_kmeans())
)
```
```{r secret_benchmarks2, echo = FALSE, warning=FALSE, message=FALSE}
if (params$invalidate_cache) {
narrow_df <- simulate_block_data(3:5, lower_corr = .4, upper_corr = .6, n = 100)
wide_df <- simulate_block_data(rep(3:10, 2), lower_corr = .4, upper_corr = .6, n = 100)
wider_df <- simulate_block_data(rep(3:20, 4), lower_corr = .4, upper_corr = .6, n = 100)
icc_kmeans_benchmarks <- microbenchmark::microbenchmark(
icc_narrow = partition(narrow_df, .3),
icc_wide = partition(wide_df, .3),
icc_wider = partition(wider_df, .3),
kmeans_narrow = partition(narrow_df, .3, partitioner = part_kmeans()),
kmeans_wide = partition(wide_df, .3, partitioner = part_kmeans()),
kmeans_wider = partition(wider_df, .3, partitioner = part_kmeans())
)
readr::write_rds(icc_kmeans_benchmarks, "icc_kmeans_benchmarks.rds")
} else {
icc_kmeans_benchmarks <- readr::read_rds("icc_kmeans_benchmarks.rds")
}
icc_kmeans_benchmarks$type <- stringr::str_extract(icc_kmeans_benchmarks$expr, "icc|kmeans")
ggplot2::autoplot(icc_kmeans_benchmarks) %+%
ggplot2::stat_ydensity(color = "#0072B2", fill = "#0072B2BF") +
ggplot2::facet_wrap(~type, ncol = 1, scales = "free_y") +
ggplot2::theme_minimal()
```
For more information, see [our paper in Bioinformatics](https://doi.org/10.1093/bioinformatics/btz661), which discusses these issues in more depth [@R-partition].
## Contributing
Please read the [Contributor Guidelines](https://github.com/USCbiostats/partition/blob/master/.github/CONTRIBUTING.md) prior to submitting a pull request to partition. Also note that this project is released with a [Contributor Code of Conduct](https://github.com/USCbiostats/partition/blob/master/.github/CODE_OF_CONDUCT.md). By participating in this project you agree to abide by its terms.
## References
Owner
- Name: USC Division of Biostatistics
- Login: USCbiostats
- Kind: organization
- Location: Los Angeles, CA
- Website: https://biostatsepi.usc.edu/
- Repositories: 34
- Profile: https://github.com/USCbiostats
JOSS Publication
partition: A fast and flexible framework for data reduction in R
Published
March 18, 2020
Volume 5, Issue 47, Page 1991
Authors
Joshua Millstein
Department of Preventive Medicine, University of Southern California
Department of Preventive Medicine, University of Southern California
Papers & Mentions
Total mentions: 2
Evolutionary analysis reveals regulatory and functional landscape of coding and non-coding RNA editing
- DOI: 10.1371/journal.pgen.1006563
- OpenAlex ID: https://openalex.org/W2586586494
- Published: February 2017
Last synced: 4 months ago
Identification and Analysis of Co-Occurrence Networks with NetCutter
- DOI: 10.1371/journal.pone.0003178
- OpenAlex ID: https://openalex.org/W2006915645
- Published: September 2008
Last synced: 4 months ago
GitHub Events
Total
- Issues event: 1
- Watch event: 3
- Issue comment event: 4
- Push event: 1
- Pull request review event: 3
- Pull request review comment event: 2
- Pull request event: 2
Last Year
- Issues event: 1
- Watch event: 3
- Issue comment event: 4
- Push event: 1
- Pull request review event: 3
- Pull request review comment event: 2
- Pull request event: 2
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Malcolm Barrett | m****t@g****m | 207 |
| katelynqueen98 | k****n@u****u | 27 |
| George G. Vega Yon | g****n@g****m | 2 |
| Arfon Smith | a****n | 1 |
Committer Domains (Top 20 + Academic)
usc.edu: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 21
- Total pull requests: 23
- Average time to close issues: 30 days
- Average time to close pull requests: 2 days
- Total issue authors: 5
- Total pull request authors: 4
- Average comments per issue: 1.29
- Average comments per pull request: 0.83
- Merged pull requests: 23
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 2
- Pull requests: 2
- Average time to close issues: about 2 hours
- Average time to close pull requests: 4 days
- Issue authors: 2
- Pull request authors: 2
- Average comments per issue: 2.0
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- malcolmbarrett (14)
- millstei (3)
- clauswilke (2)
- krlmlr (1)
- jtian123 (1)
Pull Request Authors
- malcolmbarrett (21)
- katelynqueen98 (5)
- arfon (1)
- gvegayon (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 323 last-month
- Total dependent packages: 1
- Total dependent repositories: 1
- Total versions: 8
- Total maintainers: 1
cran.r-project.org: partition
Agglomerative Partitioning Framework for Dimension Reduction
- Homepage: https://uscbiostats.github.io/partition/
- Documentation: http://cran.r-project.org/web/packages/partition/partition.pdf
- License: MIT + file LICENSE
-
Latest release: 0.2.2
published over 1 year ago
Rankings
Stargazers count: 8.6%
Forks count: 12.2%
Dependent packages count: 18.1%
Average: 19.5%
Dependent repos count: 24.0%
Downloads: 34.7%
Maintainers (1)
Last synced:
6 months ago
Dependencies
DESCRIPTION
cran
- R >= 3.3.0 depends
- MASS * imports
- Rcpp * imports
- crayon * imports
- dplyr >= 0.8.0 imports
- forcats * imports
- ggplot2 >= 3.3.0 imports
- infotheo * imports
- magrittr * imports
- pillar * imports
- purrr * imports
- rlang * imports
- stringr * imports
- tibble * imports
- tidyr >= 1.0.0 imports
- covr * suggests
- ggcorrplot * suggests
- knitr * suggests
- rmarkdown * suggests
- spelling * suggests
- testthat >= 3.0.0 suggests
.github/workflows/R-CMD-check.yaml
actions
- actions/checkout v2 composite
- actions/upload-artifact main composite
- r-lib/actions/check-r-package v1 composite
- r-lib/actions/setup-pandoc v1 composite
- r-lib/actions/setup-r v1 composite
- r-lib/actions/setup-r-dependencies v1 composite
.github/workflows/pkgdown.yaml
actions
- actions/checkout master composite
- r-lib/actions/setup-pandoc master composite
- r-lib/actions/setup-r master composite
