naivebayes

High performance implementation of the Naive Bayes algorithm in R

https://github.com/majkamichal/naivebayes

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (19.5%) to scientific vocabulary

Keywords

classification-model datascience machine-learning naive-bayes r r-package
Last synced: 6 months ago · JSON representation

Repository

High performance implementation of the Naive Bayes algorithm in R

Basic Info
Statistics
  • Stars: 37
  • Watchers: 3
  • Forks: 5
  • Open Issues: 0
  • Releases: 2
Topics
classification-model datascience machine-learning naive-bayes r r-package
Created over 8 years ago · Last pushed about 1 year ago
Metadata Files
Readme Changelog License

README.Rmd

---
output: github_document
---



```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/"
)
```


Extended documentation can be found on the website: https://majkamichal.github.io/naivebayes/  


# Naïve Bayes 



[![BuyMeACoffee](https://img.buymeacoffee.com/button-api/?text=Buy me a coffee&emoji=&slug=michalmajka&button_colour=5F7FFF&font_colour=ffffff&font_family=Cookie&outline_colour=000000&coffee_colour=FFDD00)](https://www.buymeacoffee.com/michalmajka)

[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/naivebayes)](https://cran.r-project.org/package=naivebayes)
[![](http://cranlogs.r-pkg.org/badges/naivebayes)](http://cran.rstudio.com/web/packages/naivebayes/index.html)
[![](http://cranlogs.r-pkg.org/badges/grand-total/naivebayes?color=blue)](https://cran.r-project.org/package=naivebayes)[![Anaconda Cloud](https://anaconda.org/conda-forge/r-naivebayes/badges/version.svg?=style=flat-square&color=green)](https://anaconda.org/conda-forge/r-naivebayes/)
[![Anaconda Cloud](https://anaconda.org/conda-forge/r-naivebayes/badges/downloads.svg?color=blue)](https://anaconda.org/conda-forge/r-naivebayes/)





## 1. Overview

The `naivebayes` package presents an efficient implementation of the widely-used Naïve Bayes classifier. It upholds three core principles: efficiency, user-friendliness, and reliance solely on Base `R`. By adhering to the latter principle, the package ensures stability and reliability without introducing external dependencies^[Specialized Naïve Bayes functions within the package may optionally utilize sparse matrices if the Matrix package is installed. However, the Matrix package is not a dependency, and users are not required to install or use it.]. This design choice maintains efficiency by leveraging the optimized routines inherent in Base `R`, many of which are programmed in high-performance languages like `C/C++` or `FORTRAN`. By following these principles, the `naivebayes` package provides a reliable and efficient tool for Naïve Bayes classification tasks, ensuring that users can perform their analyses effectively and with ease. 

The `naive_bayes()` function is designed to determine the class of each feature in a dataset, and depending on user specifications, it can assume various distributions for each feature. It currently supports the following class conditional distributions:

  - categorical distribution for discrete features (with Bernoulli distribution as a special case for binary outcomes)
  - Poisson distribution for non-negative integer features
  - Gaussian distribution for continuous features 
  - non-parametrically estimated densities via Kernel Density Estimation for continuous features

In addition to that specialized functions are available which implement:

 - Bernoulli Naive Bayes via `bernoulli_naive_bayes()`
 - Multinomial Naive Bayes via `multinomial_naive_bayes()`
 - Poisson Naive Bayes via `poisson_naive_bayes()`
 - Gaussian Naive Bayes via `gaussian_naive_bayes()`
 - Non-Parametric Naive Bayes via `nonparametric_naive_bayes()`

These specialized functions are carefully optimized for efficiency, utilizing linear algebra operations to excel when handling dense matrices. Additionally, they can also exploit *sparsity* of matrices for enhanced performance and work in presence of missing data. The package also includes various helper functions to improve user experience. Moreover, users can access the general `naive_bayes()` function through the excellent `Caret` package, providing additional versatility.


## 2. Installation

The `naivebayes` package can be installed from the `CRAN` repository by simply executing in the console the following line:

```{r, eval = FALSE}
install.packages("naivebayes")

# Or the the development version from GitHub:
devtools::install_github("majkamichal/naivebayes")
```

## 3. Usage

The `naivebayes` package provides a user friendly implementation of the Naïve Bayes algorithm via formula interlace and classical combination of the matrix/data.frame containing the features and a vector with the class labels. All functions can recognize missing values, give an informative warning and more importantly - they know how to handle them. In following the basic usage of the main function `naive_bayes()` is demonstrated. Examples with the specialized Naive Bayes classifiers can be found in the extended documentation: https://majkamichal.github.io/naivebayes/ in [this article](https://majkamichal.github.io/naivebayes/articles/specialized_naive_bayes.html).


### 3.1 Example data

```{r data}
library(naivebayes)

# Simulate example data
n <- 100
set.seed(1)
data <- data.frame(class = sample(c("classA", "classB"), n, TRUE),
                   bern = sample(LETTERS[1:2], n, TRUE),
                   cat  = sample(letters[1:3], n, TRUE),
                   logical = sample(c(TRUE,FALSE), n, TRUE),
                   norm = rnorm(n),
                   count = rpois(n, lambda = c(5,15)))
train <- data[1:95, ]
test <- data[96:100, -1]
```

### 3.2 Formula interface

```{r formula_interface}
nb <- naive_bayes(class ~ ., train)
summary(nb)

# Classification
predict(nb, test, type = "class")
nb %class% test

# Posterior probabilities
predict(nb, test, type = "prob")
nb %prob% test

# Helper functions
tables(nb, 1)
get_cond_dist(nb)

# Note: all "numeric" (integer, double) variables are modelled
#       with Gaussian distribution by default.
```

### 3.3 Matrix/data.frame and class vector

```{r generalusage}
X <- train[-1]
class <- train$class
nb2 <- naive_bayes(x = X, y = class)
nb2 %prob% test
```



### 3.4 Non-parametric estimation for continuous features

Kernel density estimation can be used to estimate class conditional densities of continuous features. It has to be explicitly requested via the parameter `usekernel=TRUE` otherwise Gaussian distribution will be assumed. The estimation is performed with the built in `R` function `density()`. By default, Gaussian smoothing kernel and Silverman's rule of thumb as bandwidth selector are used:

```{r kde}
nb_kde <- naive_bayes(class ~ ., train, usekernel = TRUE)
summary(nb_kde)
get_cond_dist(nb_kde)
nb_kde %prob% test

# Class conditional densities
plot(nb_kde, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")

# Marginal densities
plot(nb_kde, "norm", arg.num = list(legend.cex = 0.9), prob = "marginal")
```


#### 3.4.1 Changing kernel

In general, there are 7 different smoothing kernels available:

- `gaussian` 
- `epanechnikov`
- `rectangular`
- `triangular`
- `biweight`
- `cosine`
- `optcosine`

and they can be specified in `naive_bayes()` via parameter additional parameter `kernel`. Gaussian kernel is the default smoothing kernel. Please see `density()` and `bw.nrd()` for further details.

```{r kde_kernel}
# Change Gaussian kernel to biweight kernel
nb_kde_biweight <- naive_bayes(class ~ ., train, usekernel = TRUE,
                               kernel = "biweight")
nb_kde_biweight %prob% test
plot(nb_kde_biweight, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")
```


#### 3.4.2 Changing bandwidth selector

The `density()` function offers 5 different bandwidth selectors, which can be specified via `bw` parameter:

- `nrd0` (Silverman's rule-of-thumb)
- `nrd` (variation of the rule-of-thumb)
- `ucv` (unbiased cross-validation)
- `bcv` (biased cross-validation)
- `SJ` (Sheather & Jones method)


```{r kde_bw}
nb_kde_SJ <- naive_bayes(class ~ ., train, usekernel = TRUE,
                               bw = "SJ")
nb_kde_SJ %prob% test
plot(nb_kde_SJ, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")
```

#### 3.4.3 Adjusting bandwidth

The parameter `adjust` allows to rescale the estimated bandwidth and thus introduces more flexibility to the estimation process. For values below 1 (no rescaling; default setting) the density becomes "wigglier" and for values above 1 the density tends to be "smoother":


```{r kde_adjust}
nb_kde_adjust <- naive_bayes(class ~ ., train, usekernel = TRUE,
                         adjust = 0.5)
nb_kde_adjust %prob% test
plot(nb_kde_adjust, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")
```


### 3.5 Model non-negative integers with Poisson distribution

Class conditional distributions of non-negative integer predictors can be modelled with Poisson distribution. This can be achieved by setting `usepoisson=TRUE` in the `naive_bayes()` function and by making sure that the variables representing counts in the dataset are of class `integer`.

```{r poisson}
is.integer(train$count)
nb_pois <- naive_bayes(class ~ ., train, usepoisson = TRUE)
summary(nb_pois)
get_cond_dist(nb_pois)

nb_pois %prob% test

# Class conditional distributions
plot(nb_pois, "count", prob = "conditional")

# Marginal distributions
plot(nb_pois, "count", prob = "marginal")
```

Owner

  • Name: Michal Majka
  • Login: majkamichal
  • Kind: user
  • Location: Vienna

Stats @univienna @DukeU and @ErsteGroup www.stackoverflow.com/users/4956364/michal-majka

GitHub Events

Total
  • Push event: 1
Last Year
  • Push event: 1

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 236
  • Total Committers: 1
  • Avg Commits per committer: 236.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 7
  • Committers: 1
  • Avg Commits per committer: 7.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Michal Majka m****a@h****m 236

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 6
  • Total pull requests: 3
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 28 days
  • Total issue authors: 6
  • Total pull request authors: 3
  • Average comments per issue: 1.33
  • Average comments per pull request: 0.67
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • kohlkopf (1)
  • willtownes (1)
  • markvanderloo (1)
  • hyaochn (1)
  • AndreM84 (1)
  • BharatAD (1)
Pull Request Authors
  • majkamichal (1)
  • willtownes (1)
  • colin-fraser (1)
Top Labels
Issue Labels
bug (3) help wanted (2) wontfix (1)
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • cran 3,618 last-month
  • Total docker downloads: 21,870
  • Total dependent packages: 10
    (may contain duplicates)
  • Total dependent repositories: 15
    (may contain duplicates)
  • Total versions: 11
  • Total maintainers: 1
cran.r-project.org: naivebayes

High Performance Implementation of the Naive Bayes Algorithm

  • Versions: 9
  • Dependent Packages: 10
  • Dependent Repositories: 13
  • Downloads: 3,618 Last month
  • Docker Downloads: 21,870
Rankings
Dependent packages count: 5.3%
Downloads: 5.5%
Dependent repos count: 8.0%
Stargazers count: 9.0%
Average: 10.5%
Forks count: 10.8%
Docker downloads count: 24.2%
Maintainers (1)
Last synced: 6 months ago
conda-forge.org: r-naivebayes
  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 2
Rankings
Dependent repos count: 20.2%
Average: 42.5%
Stargazers count: 46.1%
Dependent packages count: 51.6%
Forks count: 52.2%
Last synced: 7 months ago

Dependencies

DESCRIPTION cran
  • Matrix * suggests
  • knitr * suggests