bigstatsr

R package for statistical tools with big matrices stored on disk.

https://github.com/privefl/bigstatsr

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.6%) to scientific vocabulary

Keywords

big-data large-matrices memory-mapped-file parallel-computing r r-package statistical-methods

Keywords from Contributors

bioinformatics polygenic-scores population-structure-inference snp-data
Last synced: 6 months ago · JSON representation

Repository

R package for statistical tools with big matrices stored on disk.

Basic Info
Statistics
  • Stars: 180
  • Watchers: 4
  • Forks: 31
  • Open Issues: 8
  • Releases: 0
Topics
big-data large-matrices memory-mapped-file parallel-computing r r-package statistical-methods
Created over 9 years ago · Last pushed 7 months ago
Metadata Files
Readme Changelog

README.md

R build status Codecov test coverage CRAN status DOI <!-- badges: end -->

bigstatsr

R package {bigstatsr} provides functions for fast statistical analysis of large-scale data encoded as matrices. The package can handle matrices that are too large to fit in memory thanks to memory-mapping to binary files on disk. This is very similar to the format big.matrix provided by R package {bigmemory}, which is no longer used by this package (see the corresponding vignette). As inputs, package {bigstatsr} uses Filebacked Big Matrices (FBM).

LIST OF FEATURES

Note that most of the algorithms of this package don't handle missing values.

Installation

```r

For the CRAN version

install.packages("bigstatsr")

For the latest version

remotes::install_github("privefl/bigstatsr") ```

Small example

```r library(bigstatsr)

Create the data on disk

X <- FBM(5e3, 10e3, backingfile = "test")$save()

If you open a new session you can do

X <- big_attach("test.rds")

Fill it by chunks with random values

U <- matrix(0, nrow(X), 5); U[] <- rnorm(length(U)) V <- matrix(0, ncol(X), 5); V[] <- rnorm(length(V)) NCORES <- nb_cores()

X = U V^T + E

big_apply(X, a.FUN = function(X, ind, U, V) { X[, ind] <- tcrossprod(U, V[ind, ]) + rnorm(nrow(X) * length(ind)) NULL ## you don't want to return anything here }, a.combine = 'c', ncores = NCORES, U = U, V = V)

Check some values

X[1:5, 1:5]

Compute first 10 PCs

obj.svd <- bigrandomSVD(X, fun.scaling = bigscale(), k = 10, ncores = NCORES) plot(obj.svd)

Cleanup

unlink(paste0("test", c(".bk", ".rds"))) ```

Learn more with this introduction to package {bigstatsr}.

If you want to use Rcpp code, look at this tutorial.

Some use cases

Parallelization

Package {bigstatsr} uses package {foreach} for its parallelization tasks. Learn more on parallelism with {foreach} with this tutorial.

Large datasets

Bug report / Help

How to make a great R reproducible example?

Please open an issue if you find a bug.

If you want help using {bigstatsr}, please open an issue as well or post on Stack Overflow with the tag bigstatsr.

I will always redirect you to GitHub issues if you email me, so that others can benefit from our discussion.

References

  • Privé, Florian, et al. "Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr." Bioinformatics 34.16 (2018): 2781-2787.

  • Privé, Florian, Hugues Aschard, and Michael GB Blum. "Efficient implementation of penalized regression for genetic risk prediction." Genetics 212.1 (2019): 65-74.


Owner

  • Name: Florian Privé
  • Login: privefl
  • Kind: user
  • Location: Aarhus, Denmark // Lyon, France
  • Company: National Center for Register-based Research (NCRR)

Senior Researcher (2022-) • Postdoc (2019-2021) • PhD student (2016-2019) in predictive human genetics • ENSIMAG (2013-2016)

GitHub Events

Total
  • Issues event: 10
  • Watch event: 3
  • Issue comment event: 9
  • Push event: 1
  • Fork event: 1
Last Year
  • Issues event: 10
  • Watch event: 3
  • Issue comment event: 9
  • Push event: 1
  • Fork event: 1

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 906
  • Total Committers: 6
  • Avg Commits per committer: 151.0
  • Development Distribution Score (DDS): 0.043
Past Year
  • Commits: 1
  • Committers: 1
  • Avg Commits per committer: 1.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Florian Privé f****1@g****m 867
Florian Privé f****e@i****g 18
privef p****f@t****r 16
Florian Franck Privé a****3@u****k 3
Katrin Leinweber k****i@p****e 1
Jeroen j****s@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 123
  • Total pull requests: 9
  • Average time to close issues: 4 months
  • Average time to close pull requests: 2 months
  • Total issue authors: 58
  • Total pull request authors: 4
  • Average comments per issue: 3.53
  • Average comments per pull request: 1.56
  • Merged pull requests: 7
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 6
  • Pull requests: 1
  • Average time to close issues: 9 months
  • Average time to close pull requests: 3 days
  • Issue authors: 4
  • Pull request authors: 1
  • Average comments per issue: 0.17
  • Average comments per pull request: 2.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • privefl (39)
  • garyzhubc (6)
  • nbenn (5)
  • minhnd212 (4)
  • dramanica (3)
  • biona001 (3)
  • msesia (3)
  • annilk (3)
  • opain (2)
  • mkelcb (2)
  • dongleihu (2)
  • chrisraynerr (2)
  • leocob (2)
  • gaochengPRC (2)
  • mj-thompson (2)
Pull Request Authors
  • privefl (5)
  • dramanica (2)
  • Minta821 (1)
  • brieuclehmann (1)
Top Labels
Issue Labels
enhancement (17) feature request (10) help wanted (7) question (7) bug (6) Good for first PR (4) wont-do-unless (1) improve documentation (1)
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • cran 4,051 last-month
  • Total docker downloads: 45,783
  • Total dependent packages: 11
    (may contain duplicates)
  • Total dependent repositories: 30
    (may contain duplicates)
  • Total versions: 23
  • Total maintainers: 1
cran.r-project.org: bigstatsr

Statistical Tools for Filebacked Big Matrices

  • Versions: 19
  • Dependent Packages: 11
  • Dependent Repositories: 29
  • Downloads: 4,051 Last month
  • Docker Downloads: 45,783
Rankings
Stargazers count: 2.5%
Forks count: 3.0%
Dependent packages count: 5.0%
Dependent repos count: 5.0%
Average: 7.4%
Downloads: 7.8%
Docker downloads count: 21.1%
Maintainers (1)
Last synced: 6 months ago
conda-forge.org: r-bigstatsr
  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 1
Rankings
Dependent repos count: 24.4%
Stargazers count: 28.2%
Forks count: 32.4%
Average: 34.1%
Dependent packages count: 51.6%
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.3 depends
  • RSpectra * imports
  • Rcpp * imports
  • bigassertr >= 0.1.1 imports
  • bigparallelr >= 0.2.3 imports
  • cowplot * imports
  • foreach * imports
  • ggplot2 >= 3.0 imports
  • graphics * imports
  • methods * imports
  • ps >= 1.4 imports
  • rmio >= 0.4 imports
  • stats * imports
  • tibble * imports
  • utils * imports
  • ModelMetrics * suggests
  • RhpcBLASctl * suggests
  • bigmemory >= 4.5.33 suggests
  • bigreadr >= 0.2 suggests
  • covr * suggests
  • data.table * suggests
  • dplyr * suggests
  • glmnet * suggests
  • hexbin * suggests
  • memuse * suggests
  • ppcor * suggests
  • spelling >= 1.2 suggests
  • testthat * suggests
.github/workflows/check-standard.yaml actions
  • actions/checkout v2 composite
  • actions/upload-artifact main composite
  • r-lib/actions/check-r-package v1 composite
  • r-lib/actions/setup-pandoc v1 composite
  • r-lib/actions/setup-r v1 composite
  • r-lib/actions/setup-r-dependencies v1 composite
.github/workflows/test-coverage.yaml actions
  • actions/checkout v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite