https://github.com/csadorf/cuml

cuML - RAPIDS Machine Learning Library

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.9%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

cuML - RAPIDS Machine Learning Library

Basic Info

Host: GitHub
Owner: csadorf
License: apache-2.0
Language: C++
Default Branch: branch-22.12
Homepage:
Size: 159 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Fork of rapidsai/cuml

Created over 3 years ago · Last pushed 11 months ago

Metadata Files

Readme Changelog Contributing License Codeowners

cuML - GPU Machine Learning Algorithms

cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions that share compatible APIs with other RAPIDS projects.

cuML enables data scientists, researchers, and software engineers to run traditional tabular ML tasks on GPUs without going into the details of CUDA programming. In most cases, cuML's Python API matches the API from scikit-learn.

For large datasets, these GPU-based implementations can complete 10-50x faster than their CPU equivalents. For details on performance, see the cuML Benchmarks Notebook.

As an example, the following Python snippet loads input and computes DBSCAN clusters, all on GPU, using cuDF: ```python import cudf from cuml.cluster import DBSCAN

Create and populate a GPU DataFrame

gdffloat = cudf.DataFrame() gdffloat['0'] = [1.0, 2.0, 5.0] gdffloat['1'] = [4.0, 2.0, 1.0] gdffloat['2'] = [4.0, 2.0, 1.0]

Setup and fit clusters

dbscanfloat = DBSCAN(eps=1.0, minsamples=1) dbscanfloat.fit(gdffloat)

print(dbscanfloat.labels) ```

Output: 0 0 1 1 2 2 dtype: int32

cuML also features multi-GPU and multi-node-multi-GPU operation, using Dask, for a growing list of algorithms. The following Python snippet reads input from a CSV file and performs a NearestNeighbors query across a cluster of Dask workers, using multiple GPUs on a single node:

Initialize a LocalCUDACluster configured with UCX for fast transport of CUDA arrays ```python

Initialize UCX for high-speed transport of CUDA arrays

from dask_cuda import LocalCUDACluster

Create a Dask single-node CUDA cluster w/ one worker per device

cluster = LocalCUDACluster(protocol="ucx", enabletcpoverucx=True, enablenvlink=True, enable_infiniband=False) ```

Load data and perform k-Nearest Neighbors search. cuml.dask estimators also support Dask.Array as input: ```python

from dask.distributed import Client client = Client(cluster)

Read CSV file in parallel across workers

import daskcudf df = daskcudf.read_csv("/path/to/csv")

Fit a NearestNeighbors model and query it

from cuml.dask.neighbors import NearestNeighbors nn = NearestNeighbors(n_neighbors = 10, client=client) nn.fit(df) neighbors = nn.kneighbors(df) ```

For additional examples, browse our complete API documentation, or check out our example walkthrough notebooks. Finally, you can find complete end-to-end examples in the notebooks-contrib repo.

Supported Algorithms

| Category | Algorithm | Notes | | --- | --- | --- | | Clustering | Density-Based Spatial Clustering of Applications with Noise (DBSCAN) | Multi-node multi-GPU via Dask | | | Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) | | | | K-Means | Multi-node multi-GPU via Dask | | | Single-Linkage Agglomerative Clustering | | | Dimensionality Reduction | Principal Components Analysis (PCA) | Multi-node multi-GPU via Dask| | | Incremental PCA | | | | Truncated Singular Value Decomposition (tSVD) | Multi-node multi-GPU via Dask | | | Uniform Manifold Approximation and Projection (UMAP) | Multi-node multi-GPU Inference via Dask | | | Random Projection | | | | t-Distributed Stochastic Neighbor Embedding (TSNE) | | | Linear Models for Regression or Classification | Linear Regression (OLS) | Multi-node multi-GPU via Dask | | | Linear Regression with Lasso or Ridge Regularization | Multi-node multi-GPU via Dask | | | ElasticNet Regression | | | | LARS Regression | (experimental) | | | Logistic Regression | Multi-node multi-GPU via Dask-GLM demo | | | Naive Bayes | Multi-node multi-GPU via Dask | | | Stochastic Gradient Descent (SGD), Coordinate Descent (CD), and Quasi-Newton (QN) (including L-BFGS and OWL-QN) solvers for linear models | | | Nonlinear Models for Regression or Classification | Random Forest (RF) Classification | Experimental multi-node multi-GPU via Dask | | | Random Forest (RF) Regression | Experimental multi-node multi-GPU via Dask | | | Inference for decision tree-based models | Forest Inference Library (FIL) | | | K-Nearest Neighbors (KNN) Classification | Multi-node multi-GPU via Dask+UCX, uses Faiss for Nearest Neighbors Query. | | | K-Nearest Neighbors (KNN) Regression | Multi-node multi-GPU via Dask+UCX, uses Faiss for Nearest Neighbors Query. | | | Support Vector Machine Classifier (SVC) | | | | Epsilon-Support Vector Regression (SVR) | | | Preprocessing | Standardization, or mean removal and variance scaling / Normalization / Encoding categorical features / Discretization / Imputation of missing values / Polynomial features generation / and coming soon custom transformers and non-linear transformation | Based on Scikit-Learn preprocessing | Time Series | Holt-Winters Exponential Smoothing | | | | Auto-regressive Integrated Moving Average (ARIMA) | Supports seasonality (SARIMA) | | Model Explanation | SHAP Kernel Explainer
| Based on SHAP | | | SHAP Permutation Explainer
| Based on SHAP | | Other | K-Nearest Neighbors (KNN) Search | Multi-node multi-GPU via Dask+UCX, uses Faiss for Nearest Neighbors Query. |

Installation

See the RAPIDS Release Selector for the command line to install either nightly or official release cuML packages via Conda or Docker.

Build/Install from Source

See the build guide.

Contributing

Please see our guide for contributing to cuML.

References

The RAPIDS team has a number of blogs with deeper technical dives and examples. You can find them here on Medium.

For additional details on the technologies behind cuML, as well as a broader overview of the Python Machine Learning landscape, see Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence (2020) by Sebastian Raschka, Joshua Patterson, and Corey Nolet.

Please consider citing this when using cuML in a project. You can use the citation BibTeX:

bibtex @article{raschka2020machine, title={Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence}, author={Raschka, Sebastian and Patterson, Joshua and Nolet, Corey}, journal={arXiv preprint arXiv:2002.04803}, year={2020} }

Contact

Find out more details on the RAPIDS site

Open GPU Data Science

The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

Owner

Name: Carl Simon Adorf
Login: csadorf
Kind: user
Location: Lausanne, CH
Company: NVIDIA

Website: https://carlsimonadorf.com
Twitter: carlsimonadorf
Repositories: 5
Profile: https://github.com/csadorf

SE @NVIDIA working on @rapidsai

GitHub Events

Total

Delete event: 43
Push event: 191
Create event: 66

Last Year

Delete event: 43
Push event: 191
Create event: 66

Dependencies

.github/workflows/labeler.yml actions

actions/labeler main composite

.github/workflows/new-issues-to-triage-projects.yml actions

docker://takanabe/github-actions-automate-projects v0.0.1 composite

Dockerfile docker

cudf latest build

python/setup.py pypi

cython *
numba *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/csadorf/cuml

Science Score: 10.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

cuML - GPU Machine Learning Algorithms

Create and populate a GPU DataFrame

Setup and fit clusters

Initialize UCX for high-speed transport of CUDA arrays

Create a Dask single-node CUDA cluster w/ one worker per device

Read CSV file in parallel across workers

Fit a NearestNeighbors model and query it

Supported Algorithms

Installation

Build/Install from Source

Contributing

References

Contact

Open GPU Data Science

Owner

GitHub Events

Total

Last Year

Dependencies