bigstatsr
R package for statistical tools with big matrices stored on disk.
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.6%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
R package for statistical tools with big matrices stored on disk.
Basic Info
- Host: GitHub
- Owner: privefl
- Language: R
- Default Branch: master
- Homepage: https://privefl.github.io/bigstatsr/
- Size: 38.3 MB
Statistics
- Stars: 180
- Watchers: 4
- Forks: 31
- Open Issues: 8
- Releases: 0
Topics
Metadata Files
README.md
bigstatsr

R package {bigstatsr} provides functions for fast statistical analysis of large-scale data encoded as matrices. The package can handle matrices that are too large to fit in memory thanks to memory-mapping to binary files on disk. This is very similar to the format big.matrix provided by R package {bigmemory}, which is no longer used by this package (see the corresponding vignette).
As inputs, package {bigstatsr} uses Filebacked Big Matrices (FBM).
Note that most of the algorithms of this package don't handle missing values.
Installation
```r
For the CRAN version
install.packages("bigstatsr")
For the latest version
remotes::install_github("privefl/bigstatsr") ```
Small example
```r library(bigstatsr)
Create the data on disk
X <- FBM(5e3, 10e3, backingfile = "test")$save()
If you open a new session you can do
X <- big_attach("test.rds")
Fill it by chunks with random values
U <- matrix(0, nrow(X), 5); U[] <- rnorm(length(U)) V <- matrix(0, ncol(X), 5); V[] <- rnorm(length(V)) NCORES <- nb_cores()
X = U V^T + E
big_apply(X, a.FUN = function(X, ind, U, V) { X[, ind] <- tcrossprod(U, V[ind, ]) + rnorm(nrow(X) * length(ind)) NULL ## you don't want to return anything here }, a.combine = 'c', ncores = NCORES, U = U, V = V)
Check some values
X[1:5, 1:5]
Compute first 10 PCs
obj.svd <- bigrandomSVD(X, fun.scaling = bigscale(), k = 10, ncores = NCORES) plot(obj.svd)
Cleanup
unlink(paste0("test", c(".bk", ".rds"))) ```
Learn more with this introduction to package {bigstatsr}.
If you want to use Rcpp code, look at this tutorial.
Some use cases
Parallelization
Package {bigstatsr} uses package {foreach} for its parallelization tasks. Learn more on parallelism with {foreach} with this tutorial.
Large datasets
Computing the null space of a big matrix (works if one dimension is not too large)
Bug report / Help
How to make a great R reproducible example?
Please open an issue if you find a bug.
If you want help using {bigstatsr}, please open an issue as well or post on Stack Overflow with the tag bigstatsr.
I will always redirect you to GitHub issues if you email me, so that others can benefit from our discussion.
References
Privé, Florian, et al. "Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr." Bioinformatics 34.16 (2018): 2781-2787.
Privé, Florian, Hugues Aschard, and Michael GB Blum. "Efficient implementation of penalized regression for genetic risk prediction." Genetics 212.1 (2019): 65-74.
Owner
- Name: Florian Privé
- Login: privefl
- Kind: user
- Location: Aarhus, Denmark // Lyon, France
- Company: National Center for Register-based Research (NCRR)
- Website: https://privefl.github.io/
- Twitter: privefl
- Repositories: 104
- Profile: https://github.com/privefl
Senior Researcher (2022-) • Postdoc (2019-2021) • PhD student (2016-2019) in predictive human genetics • ENSIMAG (2013-2016)
GitHub Events
Total
- Issues event: 10
- Watch event: 3
- Issue comment event: 9
- Push event: 1
- Fork event: 1
Last Year
- Issues event: 10
- Watch event: 3
- Issue comment event: 9
- Push event: 1
- Fork event: 1
Committers
Last synced: about 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| Florian Privé | f****1@g****m | 867 |
| Florian Privé | f****e@i****g | 18 |
| privef | p****f@t****r | 16 |
| Florian Franck Privé | a****3@u****k | 3 |
| Katrin Leinweber | k****i@p****e | 1 |
| Jeroen | j****s@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 123
- Total pull requests: 9
- Average time to close issues: 4 months
- Average time to close pull requests: 2 months
- Total issue authors: 58
- Total pull request authors: 4
- Average comments per issue: 3.53
- Average comments per pull request: 1.56
- Merged pull requests: 7
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 6
- Pull requests: 1
- Average time to close issues: 9 months
- Average time to close pull requests: 3 days
- Issue authors: 4
- Pull request authors: 1
- Average comments per issue: 0.17
- Average comments per pull request: 2.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- privefl (39)
- garyzhubc (6)
- nbenn (5)
- minhnd212 (4)
- dramanica (3)
- biona001 (3)
- msesia (3)
- annilk (3)
- opain (2)
- mkelcb (2)
- dongleihu (2)
- chrisraynerr (2)
- leocob (2)
- gaochengPRC (2)
- mj-thompson (2)
Pull Request Authors
- privefl (5)
- dramanica (2)
- Minta821 (1)
- brieuclehmann (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 2
-
Total downloads:
- cran 4,051 last-month
- Total docker downloads: 45,783
-
Total dependent packages: 11
(may contain duplicates) -
Total dependent repositories: 30
(may contain duplicates) - Total versions: 23
- Total maintainers: 1
cran.r-project.org: bigstatsr
Statistical Tools for Filebacked Big Matrices
- Homepage: https://privefl.github.io/bigstatsr/
- Documentation: http://cran.r-project.org/web/packages/bigstatsr/bigstatsr.pdf
- License: GPL-3
-
Latest release: 1.6.2
published 7 months ago
Rankings
Maintainers (1)
conda-forge.org: r-bigstatsr
- Homepage: https://privefl.github.io/bigstatsr/
- License: GPL-3.0-only
-
Latest release: 1.5.12
published over 3 years ago
Rankings
Dependencies
- R >= 3.3 depends
- RSpectra * imports
- Rcpp * imports
- bigassertr >= 0.1.1 imports
- bigparallelr >= 0.2.3 imports
- cowplot * imports
- foreach * imports
- ggplot2 >= 3.0 imports
- graphics * imports
- methods * imports
- ps >= 1.4 imports
- rmio >= 0.4 imports
- stats * imports
- tibble * imports
- utils * imports
- ModelMetrics * suggests
- RhpcBLASctl * suggests
- bigmemory >= 4.5.33 suggests
- bigreadr >= 0.2 suggests
- covr * suggests
- data.table * suggests
- dplyr * suggests
- glmnet * suggests
- hexbin * suggests
- memuse * suggests
- ppcor * suggests
- spelling >= 1.2 suggests
- testthat * suggests
- actions/checkout v2 composite
- actions/upload-artifact main composite
- r-lib/actions/check-r-package v1 composite
- r-lib/actions/setup-pandoc v1 composite
- r-lib/actions/setup-r v1 composite
- r-lib/actions/setup-r-dependencies v1 composite
- actions/checkout v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite