partition

partition: A fast and flexible framework for data reduction in R - Published in JOSS (2020)

https://github.com/uscbiostats/partition

Science Score: 98.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 8 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org, zenodo.org
✓
Committers with academic emails
1 of 4 committers (25.0%) from academic institutions
✓
Institutional organization owner
Organization uscbiostats has institutional domain (biostatsepi.usc.edu)
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

data-reduction dimensionality-reduction partitional-clustering r

Last synced: 6 months ago · JSON representation

Repository

A fast and flexible framework for data reduction in R

Basic Info

Host: GitHub
Owner: USCbiostats
License: other
Language: HTML
Default Branch: master
Homepage: https://uscbiostats.github.io/partition/
Size: 15.1 MB

Statistics

Stars: 37
Watchers: 3
Forks: 4
Open Issues: 2
Releases: 9

Topics

data-reduction dimensionality-reduction partitional-clustering r

Created almost 7 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog Contributing License Code of conduct

README.Rmd

---
output: github_document
references:
- id: R-partition
  type: article-journal
  author:
  - family: Millstein
    given: Joshua
  - family: Battaglin
    given: Francesca
  - family: Barrett
    given: Malcolm
  - family: Cao
    given: Shu
  - family: Zhang
    given: Wu
  - family: Stintzing
    given: Sebastian
  - family: Heinemann
    given: Volker
  - family: Lenz
    given: Heinz-Josef
  issued:
  - year: 2020
  title: 'Partition: A surjective mapping approach for dimensionality reduction'
  title-short: Partition
  container-title: Bioinformatics
  page: 676-681
  volume: '36'
  issue: '3'
  URL: 'https://doi.org/10.1093/bioinformatics/btz661'
params:
  invalidate_cache: false
---



```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%",
  dpi = 320
)
```


[![R-CMD-check](https://github.com/USCbiostats/partition/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/USCbiostats/partition/actions/workflows/R-CMD-check.yaml)
[![Coverage status](https://codecov.io/gh/USCbiostats/partition/branch/master/graph/badge.svg)](https://app.codecov.io/github/USCbiostats/partition?branch=master)
[![CRAN status](https://www.r-pkg.org/badges/version-ago/partition)](https://cran.r-project.org/package=partition)
[![JOSS](https://joss.theoj.org/papers/10.21105/joss.01991/status.svg)](https://doi.org/10.21105/joss.01991)
[![DOI](https://zenodo.org/badge/178615892.svg)](https://zenodo.org/badge/latestdoi/178615892)
[![USC IMAGE](https://raw.githubusercontent.com/USCbiostats/badges/master/tommy-image-badge.svg)](https://image.usc.edu)

 
# partition

partition is a fast and flexible framework for agglomerative partitioning. partition uses an approach  called Direct-Measure-Reduce to create new variables that maintain the user-specified minimum level of information. Each reduced variable is also interpretable: the original variables map to one and only one variable in the reduced data set. partition is flexible, as well: how variables are selected to reduce, how information loss is measured, and the way data is reduced can all be customized.  

## Installation

You can install the partition from CRAN with:

``` r
install.packages("partition")
```

Or you can install the development version of partition GitHub with:

``` r
# install.packages("remotes")
remotes::install_github("USCbiostats/partition")
```

## Example

```{r example}
library(partition)
set.seed(1234)
df <- simulate_block_data(c(3, 4, 5), lower_corr = .4, upper_corr = .6, n = 100)

#  don't accept reductions where information < .6
prt <- partition(df, threshold = .6)
prt

# return reduced data
partition_scores(prt)

# access mapping keys
mapping_key(prt)

unnest_mappings(prt)

# use a lower threshold of information loss
partition(df, threshold = .5, partitioner = part_kmeans())

# use a custom partitioner
part_icc_rowmeans <- replace_partitioner(
  part_icc, 
  reduce = as_reducer(rowMeans)
)
partition(df, threshold = .6, partitioner = part_icc_rowmeans) 
```

partition also supports a number of ways to visualize partitions and permutation tests; these functions all start with `plot_*()`. These functions all return ggplots and can thus be extended using ggplot2.

```{r stacked_area_chart, dpi = 320}
plot_stacked_area_clusters(df) +
  ggplot2::theme_minimal(14)
```

## Performance

partition has been meticulously benchmarked and profiled to improve performance, and key sections are written in C++ or use C++-based packages. Using a data frame with 1 million rows on a 2017 MacBook Pro with 16 GB RAM, here's how each of the built-in partitioners perform: 

```{r benchmarks1, eval = FALSE}
large_df <- simulate_block_data(c(3, 4, 5), lower_corr = .4, upper_corr = .6, n = 1e6)

basic_benchmarks <- microbenchmark::microbenchmark(
  icc = partition(large_df, .3),
  kmeans = partition(large_df, .3, partitioner = part_kmeans()),
  minr2 = partition(large_df, .3, partitioner = part_minr2()),
  pc1 = partition(large_df, .3, partitioner = part_pc1()),
  stdmi = partition(large_df, .3, partitioner = part_stdmi())
)
```

```{r secret_benchmarks1, echo = FALSE, warning=FALSE, message=FALSE}
library(microbenchmark)
library(ggplot2)
if (params$invalidate_cache) {
  large_df <- simulate_block_data(c(3, 4, 5), lower_corr = .4, upper_corr = .6, n = 1e6)
  
  basic_benchmarks <- microbenchmark::microbenchmark(
    icc = partition(large_df, .3),
    kmeans = partition(large_df, .3, partitioner = part_kmeans()),
    minr2 = partition(large_df, .3, partitioner = part_minr2()),
    pc1 = partition(large_df, .3, partitioner = part_pc1()),
    stdmi = partition(large_df, .3, partitioner = part_stdmi())
  )
  
  readr::write_rds(basic_benchmarks, "basic_benchmarks.rds")
} else {
  basic_benchmarks <- readr::read_rds("basic_benchmarks.rds")
}

basic_benchmarks$expr <- forcats::fct_reorder(basic_benchmarks$expr, basic_benchmarks$time)
ggplot2::autoplot(basic_benchmarks) %+% 
  ggplot2::stat_ydensity(color = "#0072B2", fill = "#0072B2BF") +
  ggplot2::theme_minimal()
```

## ICC vs K-Means

As the features (columns) in the data set become greater than the number of observations (rows), the default ICC method scales more linearly than K-Means-based methods. While K-Means is often faster at lower dimensions, it becomes slower as the features outnumber the observations. For example, using three data sets with increasing numbers of columns, K-Means starts as the fastest and gets increasingly slower, although in this case it is still comparable to ICC:

```{r benchmarks2, eval = FALSE}
narrow_df <- simulate_block_data(3:5, lower_corr = .4, upper_corr = .6, n = 100)
wide_df <- simulate_block_data(rep(3:10, 2), lower_corr = .4, upper_corr = .6, n = 100)
wider_df <- simulate_block_data(rep(3:20, 4), lower_corr = .4, upper_corr = .6, n = 100)

icc_kmeans_benchmarks <- microbenchmark::microbenchmark(
  icc_narrow = partition(narrow_df, .3),
  icc_wide = partition(wide_df, .3),
  icc_wider = partition(wider_df, .3),
  kmeans_narrow = partition(narrow_df, .3, partitioner = part_kmeans()),
  kmeans_wide = partition(wide_df, .3, partitioner = part_kmeans()),
  kmeans_wider  = partition(wider_df, .3, partitioner = part_kmeans())
)
```

```{r secret_benchmarks2, echo = FALSE, warning=FALSE, message=FALSE}
if (params$invalidate_cache) {
  narrow_df <- simulate_block_data(3:5, lower_corr = .4, upper_corr = .6, n = 100)
  wide_df <- simulate_block_data(rep(3:10, 2), lower_corr = .4, upper_corr = .6, n = 100)
  wider_df <- simulate_block_data(rep(3:20, 4), lower_corr = .4, upper_corr = .6, n = 100)
  
  icc_kmeans_benchmarks <- microbenchmark::microbenchmark(
    icc_narrow = partition(narrow_df, .3),
    icc_wide = partition(wide_df, .3),
    icc_wider = partition(wider_df, .3),
    kmeans_narrow = partition(narrow_df, .3, partitioner = part_kmeans()),
    kmeans_wide = partition(wide_df, .3, partitioner = part_kmeans()),
    kmeans_wider  = partition(wider_df, .3, partitioner = part_kmeans())
  )
  
  readr::write_rds(icc_kmeans_benchmarks, "icc_kmeans_benchmarks.rds")
} else {
  icc_kmeans_benchmarks <- readr::read_rds("icc_kmeans_benchmarks.rds")
}

icc_kmeans_benchmarks$type <- stringr::str_extract(icc_kmeans_benchmarks$expr, "icc|kmeans")

ggplot2::autoplot(icc_kmeans_benchmarks) %+% 
  ggplot2::stat_ydensity(color = "#0072B2", fill = "#0072B2BF") +
  ggplot2::facet_wrap(~type, ncol = 1, scales = "free_y") + 
  ggplot2::theme_minimal()
```

For more information, see [our paper in Bioinformatics](https://doi.org/10.1093/bioinformatics/btz661), which discusses these issues in more depth [@R-partition].

## Contributing 

Please read the [Contributor Guidelines](https://github.com/USCbiostats/partition/blob/master/.github/CONTRIBUTING.md) prior to submitting a pull request to partition. Also note that this project is released with a [Contributor Code of Conduct](https://github.com/USCbiostats/partition/blob/master/.github/CODE_OF_CONDUCT.md). By participating in this project you agree to abide by its terms.

## References

Owner

Name: USC Division of Biostatistics
Login: USCbiostats
Kind: organization
Location: Los Angeles, CA

Website: https://biostatsepi.usc.edu/
Repositories: 34
Profile: https://github.com/USCbiostats

JOSS Publication

partition: A fast and flexible framework for data reduction in R

Published

March 18, 2020

DOI

10.21105/joss.01991

Volume 5, Issue 47, Page 1991

Authors

Malcolm Barrett

Department of Preventive Medicine, University of Southern California

Joshua Millstein
Department of Preventive Medicine, University of Southern California

Editor

Charlotte Soneson

View PDF Review Thread Software Archive

Papers & Mentions

Total mentions: 2

Evolutionary analysis reveals regulatory and functional landscape of coding and non-coding RNA editing

DOI: 10.1371/journal.pgen.1006563
OpenAlex ID: https://openalex.org/W2586586494
Published: February 2017

Last synced: 4 months ago

Identification and Analysis of Co-Occurrence Networks with NetCutter

DOI: 10.1371/journal.pone.0003178
OpenAlex ID: https://openalex.org/W2006915645
Published: September 2008

Last synced: 4 months ago

GitHub Events

Total

Issues event: 1
Watch event: 3
Issue comment event: 4
Push event: 1
Pull request review event: 3
Pull request review comment event: 2
Pull request event: 2

Last Year

Issues event: 1
Watch event: 3
Issue comment event: 4
Push event: 1
Pull request review event: 3
Pull request review comment event: 2
Pull request event: 2

Committers

Last synced: 7 months ago

All Time

Total Commits: 237
Total Committers: 4
Avg Commits per committer: 59.25
Development Distribution Score (DDS): 0.127

Past Year

Commits: 12
Committers: 2
Avg Commits per committer: 6.0
Development Distribution Score (DDS): 0.25

Top Committers

Name	Email	Commits
Malcolm Barrett	m**t@g**m	207
katelynqueen98	k**n@u**u	27
George G. Vega Yon	g**n@g**m	2
Arfon Smith	a****n	1

Committer Domains (Top 20 + Academic)

usc.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 21
Total pull requests: 23
Average time to close issues: 30 days
Average time to close pull requests: 2 days
Total issue authors: 5
Total pull request authors: 4
Average comments per issue: 1.29
Average comments per pull request: 0.83
Merged pull requests: 23
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 2
Average time to close issues: about 2 hours
Average time to close pull requests: 4 days
Issue authors: 2
Pull request authors: 2
Average comments per issue: 2.0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

malcolmbarrett (14)
millstei (3)
clauswilke (2)
krlmlr (1)
jtian123 (1)

Pull Request Authors

malcolmbarrett (21)
katelynqueen98 (5)
arfon (1)
gvegayon (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 323 last-month

Total dependent packages: 1
Total dependent repositories: 1
Total versions: 8
Total maintainers: 1

cran.r-project.org: partition

Agglomerative Partitioning Framework for Dimension Reduction

Homepage: https://uscbiostats.github.io/partition/
Documentation: http://cran.r-project.org/web/packages/partition/partition.pdf
License: MIT + file LICENSE
Latest release: 0.2.2
published over 1 year ago

Versions: 8
Dependent Packages: 1
Dependent Repositories: 1
Downloads: 323 Last month

Rankings

Stargazers count: 8.6%

Forks count: 12.2%

Dependent packages count: 18.1%

Average: 19.5%

Dependent repos count: 24.0%

Downloads: 34.7%

Maintainers (1)

malcolmbarrett@gmail.com

Last synced: 6 months ago

Dependencies

DESCRIPTION cran

R >= 3.3.0 depends
MASS * imports
Rcpp * imports
crayon * imports
dplyr >= 0.8.0 imports
forcats * imports
ggplot2 >= 3.3.0 imports
infotheo * imports
magrittr * imports
pillar * imports
purrr * imports
rlang * imports
stringr * imports
tibble * imports
tidyr >= 1.0.0 imports
covr * suggests
ggcorrplot * suggests
knitr * suggests
rmarkdown * suggests
spelling * suggests
testthat >= 3.0.0 suggests

.github/workflows/R-CMD-check.yaml actions

actions/checkout v2 composite
actions/upload-artifact main composite
r-lib/actions/check-r-package v1 composite
r-lib/actions/setup-pandoc v1 composite
r-lib/actions/setup-r v1 composite
r-lib/actions/setup-r-dependencies v1 composite

.github/workflows/pkgdown.yaml actions

actions/checkout master composite
r-lib/actions/setup-pandoc master composite
r-lib/actions/setup-r master composite