rsparse

Fast and accurate machine learning on sparse matrices - matrix factorizations, regression, classification, top-N recommendations.

https://github.com/dselivanov/rsparse

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, ieee.org, acm.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.6%) to scientific vocabulary

Keywords

collaborative-filtering factorization-machines matrix-completion matrix-factorization r recommender-system sparse-matrices svd
Last synced: 6 months ago · JSON representation

Repository

Fast and accurate machine learning on sparse matrices - matrix factorizations, regression, classification, top-N recommendations.

Basic Info
Statistics
  • Stars: 178
  • Watchers: 16
  • Forks: 30
  • Open Issues: 4
  • Releases: 0
Topics
collaborative-filtering factorization-machines matrix-completion matrix-factorization r recommender-system sparse-matrices svd
Created almost 9 years ago · Last pushed about 1 year ago
Metadata Files
Readme Changelog

README.md

rsparse

R build status codecov License Project Status <!-- badges: end -->

rsparse is an R package for statistical learning primarily on sparse matrices - matrix factorizations, factorization machines, out-of-core regression. Many of the implemented algorithms are particularly useful for recommender systems and NLP.

We've paid some attention to the implementation details - we try to avoid data copies, utilize multiple threads via OpenMP and use SIMD where appropriate. Package allows to work on datasets with millions of rows and millions of columns.

Features

Classification/Regression

  1. Follow the proximally-regularized leader which allows to solve very large linear/logistic regression problems with elastic-net penalty. Solver uses stochastic gradient descent with adaptive learning rates (so can be used for online learning - not necessary to load all data to RAM). See Ad Click Prediction: a View from the Trenches for more examples.
    • Only logistic regerssion implemented at the moment
    • Native format for matrices is CSR - Matrix::RsparseMatrix. However common R Matrix::CsparseMatrix (dgCMatrix) will be converted automatically.
  2. Factorization Machines supervised learning algorithm which learns second order polynomial interactions in a factorized way. We provide highly optimized SIMD accelerated implementation.

Matrix Factorizations

  1. Vanilla Maximum Margin Matrix Factorization - classic approch for "rating" prediction. See WRMF class and constructor option feedback = "explicit". Original paper which indroduced MMMF could be found here.
  2. Weighted Regularized Matrix Factorization (WRMF) from Collaborative Filtering for Implicit Feedback Datasets. See WRMF class and constructor option feedback = "implicit". We provide 2 solvers:
    1. Exact based on Cholesky Factorization
    2. Approximated based on fixed number of steps of Conjugate Gradient. See details in Applications of the Conjugate Gradient Method for Implicit Feedback Collaborative Filtering and Faster Implicit Matrix Factorization.
  3. Linear-Flow from Practical Linear Models for Large-Scale One-Class Collaborative Filtering. Algorithm looks for factorized low-rank item-item similarity matrix (in some sense it is similar to SLIM)
  4. Fast Truncated SVD and Truncated Soft-SVD via Alternating Least Squares as described in Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares. Works for both sparse and dense matrices. Works on float matrices as well! For certain problems may be even faster than irlba package.
  5. Soft-Impute via fast Alternating Least Squares as described in Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.
    • with a solution in SVD form
  6. GloVe as described in GloVe: Global Vectors for Word Representation.
    • This is usually used to train word embeddings, but actually also very useful for recommender systems.
  7. Matrix scaling as descibed in EigenRec: Generalizing PureSVD for Effective and Efficient Top-N Recommendations

Note: the optimized matrix operations which rparse used to offer have been moved to a separate package

Installation

Most of the algorithms benefit from OpenMP and many of them could utilize high-performance implementations of BLAS. If you want to make the maximum out of this package, please read the section below carefully.

It is recommended to:

  1. Use high-performance BLAS (such as OpenBLAS, MKL, Apple Accelerate).
  2. Add proper compiler optimizations in your ~/.R/Makevars. For example on recent processors (with AVX support) and compiler with OpenMP support, the following lines could be a good option:

CXX11FLAGS += -O3 -march=native -fopenmp CXXFLAGS += -O3 -march=native -fopenmp

Mac OS

If you are on Mac follow the instructions at https://mac.r-project.org/openmp/. After clang configuration, additionally put a PKG_CXXFLAGS += -DARMA_USE_OPENMP line in your ~/.R/Makevars. After that, install rsparse in the usual way.

Also we recommend to use vecLib - Apple’s implementations of BLAS.

sh ln -sf /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib /Library/Frameworks/R.framework/Resources/lib/libRblas.dylib

Linux

On Linux, it's enough to just create this file if it doesn't exist (~/.R/Makevars).

If using OpenBLAS, it is highly recommended to use the openmp variant rather than the pthreads variant. On Linux, it is usually available as a separate package in typical distribution package managers (e.g. for Debian, it can be obtained by installing libopenblas-openmp-dev, which is not the default version), and if there are multiple BLASes installed, can be set as the default through the Debian alternatives system - which can also be used for MKL.

Windows

By default, R for Windows comes with unoptimized BLAS and LAPACK libraries, and rsparse will prefer using Armadillo's replacements instead. In order to use BLAS, install rsparse from source (not from CRAN), removing the option -DARMA_DONT_USE_BLAS from src/Makevars.win and ideally adding -march=native (under PKG_CXXFLAGS). See this tutorial for instructions on getting R for Windows to use OpenBLAS. Alternatively, Microsoft's MRAN distribution for Windows comes with MKL.

Materials

Note that syntax is these posts/slides is not up to date since package was under active development

  1. Slides from DataFest Tbilisi(2017-11-16)

Here is example of rsparse::WRMF on lastfm360k dataset in comparison with other good implementations:

API

We follow mlapi conventions.

Release and configure

Making release

Don't forget to add DARMA_NO_DEBUG to PKG_CXXFLAGS to skip bound checks (this has significant impact on NNLS solver)

PKG_CXXFLAGS = ... -DARMA_NO_DEBUG

Configure

Generate configure:

sh autoconf configure.ac > configure && chmod +x configure

Owner

  • Name: Dmitry Selivanov
  • Login: dselivanov
  • Kind: user
  • Location: Dubai
  • Company: rexy.ai

GitHub Events

Total
  • Issues event: 3
  • Watch event: 9
  • Issue comment event: 5
  • Push event: 2
  • Pull request event: 1
  • Fork event: 1
  • Create event: 1
Last Year
  • Issues event: 3
  • Watch event: 9
  • Issue comment event: 5
  • Push event: 2
  • Pull request event: 1
  • Fork event: 1
  • Create event: 1

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 308
  • Total Committers: 6
  • Avg Commits per committer: 51.333
  • Development Distribution Score (DDS): 0.146
Past Year
  • Commits: 5
  • Committers: 3
  • Avg Commits per committer: 1.667
  • Development Distribution Score (DDS): 0.6
Top Committers
Name Email Commits
Dmitriy Selivanov s****y@g****m 263
david-cortes d****a@g****m 37
wccsnow w****w@g****m 4
Ivan K k****t@g****m 2
Anton Petrov g****s 1
AliciaSchep a****p@g****m 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 38
  • Total pull requests: 41
  • Average time to close issues: 7 months
  • Average time to close pull requests: 9 days
  • Total issue authors: 18
  • Total pull request authors: 6
  • Average comments per issue: 4.39
  • Average comments per pull request: 2.98
  • Merged pull requests: 36
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 2
  • Average time to close issues: 18 days
  • Average time to close pull requests: 3 days
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 1.0
  • Average comments per pull request: 1.5
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • dselivanov (14)
  • david-cortes (6)
  • saraswatmks (2)
  • zdebruine (2)
  • wrathematics (1)
  • Gbau08 (1)
  • dfalbel (1)
  • vkummer (1)
  • ThakurRajAnand (1)
  • RobertKomara (1)
  • jamespr615 (1)
  • hongooi73 (1)
  • mscbuck (1)
  • BradKML (1)
  • aitap (1)
Pull Request Authors
  • david-cortes (30)
  • dselivanov (5)
  • snoweye (3)
  • aitap (2)
  • AliciaSchep (1)
  • gsenseless (1)
Top Labels
Issue Labels
docs (4) feature request (2) enhancement (2) wontfix (1) installation (1) bug (1) question (1)
Pull Request Labels

Packages

  • Total packages: 3
  • Total downloads:
    • cran 46,463 last-month
  • Total docker downloads: 44,521
  • Total dependent packages: 5
    (may contain duplicates)
  • Total dependent repositories: 5
    (may contain duplicates)
  • Total versions: 20
  • Total maintainers: 1
proxy.golang.org: github.com/dselivanov/rsparse
  • Versions: 6
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 5.5%
Average: 5.6%
Dependent repos count: 5.8%
Last synced: 6 months ago
cran.r-project.org: rsparse

Statistical Learning on Sparse Matrices

  • Versions: 10
  • Dependent Packages: 4
  • Dependent Repositories: 5
  • Downloads: 46,463 Last month
  • Docker Downloads: 44,521
Rankings
Forks count: 2.5%
Stargazers count: 2.5%
Downloads: 6.2%
Average: 9.1%
Dependent packages count: 9.3%
Dependent repos count: 13.0%
Docker downloads count: 21.0%
Maintainers (1)
Last synced: 6 months ago
conda-forge.org: r-rsparse
  • Versions: 4
  • Dependent Packages: 1
  • Dependent Repositories: 0
Rankings
Dependent packages count: 28.8%
Average: 31.4%
Dependent repos count: 34.0%
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • Matrix >= 1.3 depends
  • R >= 3.6.0 depends
  • methods * depends
  • MatrixExtra >= 0.1.7 imports
  • Rcpp >= 0.11 imports
  • RhpcBLASctl * imports
  • data.table >= 1.10.0 imports
  • float >= 0.2 imports
  • lgr >= 0.2 imports
  • covr * suggests
  • testthat * suggests
.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v2 composite
  • r-lib/actions/setup-r v1 composite
.github/workflows/test-coverage.yaml actions
  • actions/cache v1 composite
  • actions/checkout v2 composite
  • r-lib/actions/setup-pandoc master composite
  • r-lib/actions/setup-r master composite