naivebayes

High performance implementation of the Naive Bayes algorithm in R

https://github.com/majkamichal/naivebayes

Keywords

classification-model datascience machine-learning naive-bayes r r-package

Last synced: 9 months ago · JSON representation

Repository

High performance implementation of the Naive Bayes algorithm in R

Basic Info

Host: GitHub
Owner: majkamichal
License: gpl-2.0
Language: R
Default Branch: master
Homepage: https://majkamichal.github.io/naivebayes/
Size: 24.1 MB

Statistics

Stars: 37
Watchers: 3
Forks: 5
Open Issues: 0
Releases: 2

Topics

classification-model datascience machine-learning naive-bayes r r-package

Created almost 9 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License

README.Rmd

---
output: github_document
---



```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/"
)
```


Extended documentation can be found on the website: https://majkamichal.github.io/naivebayes/  


# Naïve Bayes 



[![BuyMeACoffee](https://img.buymeacoffee.com/button-api/?text=Buy me a coffee&emoji=&slug=michalmajka&button_colour=5F7FFF&font_colour=ffffff&font_family=Cookie&outline_colour=000000&coffee_colour=FFDD00)](https://www.buymeacoffee.com/michalmajka)

[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/naivebayes)](https://cran.r-project.org/package=naivebayes)
[![](http://cranlogs.r-pkg.org/badges/naivebayes)](http://cran.rstudio.com/web/packages/naivebayes/index.html)
[![](http://cranlogs.r-pkg.org/badges/grand-total/naivebayes?color=blue)](https://cran.r-project.org/package=naivebayes)[![Anaconda Cloud](https://anaconda.org/conda-forge/r-naivebayes/badges/version.svg?=style=flat-square&color=green)](https://anaconda.org/conda-forge/r-naivebayes/)
[![Anaconda Cloud](https://anaconda.org/conda-forge/r-naivebayes/badges/downloads.svg?color=blue)](https://anaconda.org/conda-forge/r-naivebayes/)





## 1. Overview

The `naivebayes` package presents an efficient implementation of the widely-used Naïve Bayes classifier. It upholds three core principles: efficiency, user-friendliness, and reliance solely on Base `R`. By adhering to the latter principle, the package ensures stability and reliability without introducing external dependencies^[Specialized Naïve Bayes functions within the package may optionally utilize sparse matrices if the Matrix package is installed. However, the Matrix package is not a dependency, and users are not required to install or use it.]. This design choice maintains efficiency by leveraging the optimized routines inherent in Base `R`, many of which are programmed in high-performance languages like `C/C++` or `FORTRAN`. By following these principles, the `naivebayes` package provides a reliable and efficient tool for Naïve Bayes classification tasks, ensuring that users can perform their analyses effectively and with ease. 

The `naive_bayes()` function is designed to determine the class of each feature in a dataset, and depending on user specifications, it can assume various distributions for each feature. It currently supports the following class conditional distributions:

  - categorical distribution for discrete features (with Bernoulli distribution as a special case for binary outcomes)
  - Poisson distribution for non-negative integer features
  - Gaussian distribution for continuous features 
  - non-parametrically estimated densities via Kernel Density Estimation for continuous features

In addition to that specialized functions are available which implement:

 - Bernoulli Naive Bayes via `bernoulli_naive_bayes()`
 - Multinomial Naive Bayes via `multinomial_naive_bayes()`
 - Poisson Naive Bayes via `poisson_naive_bayes()`
 - Gaussian Naive Bayes via `gaussian_naive_bayes()`
 - Non-Parametric Naive Bayes via `nonparametric_naive_bayes()`

These specialized functions are carefully optimized for efficiency, utilizing linear algebra operations to excel when handling dense matrices. Additionally, they can also exploit *sparsity* of matrices for enhanced performance and work in presence of missing data. The package also includes various helper functions to improve user experience. Moreover, users can access the general `naive_bayes()` function through the excellent `Caret` package, providing additional versatility.


## 2. Installation

The `naivebayes` package can be installed from the `CRAN` repository by simply executing in the console the following line:

```{r, eval = FALSE}
install.packages("naivebayes")

# Or the the development version from GitHub:
devtools::install_github("majkamichal/naivebayes")
```

## 3. Usage

The `naivebayes` package provides a user friendly implementation of the Naïve Bayes algorithm via formula interlace and classical combination of the matrix/data.frame containing the features and a vector with the class labels. All functions can recognize missing values, give an informative warning and more importantly - they know how to handle them. In following the basic usage of the main function `naive_bayes()` is demonstrated. Examples with the specialized Naive Bayes classifiers can be found in the extended documentation: https://majkamichal.github.io/naivebayes/ in [this article](https://majkamichal.github.io/naivebayes/articles/specialized_naive_bayes.html).


### 3.1 Example data

```{r data}
library(naivebayes)

# Simulate example data
n <- 100
set.seed(1)
data <- data.frame(class = sample(c("classA", "classB"), n, TRUE),
                   bern = sample(LETTERS[1:2], n, TRUE),
                   cat  = sample(letters[1:3], n, TRUE),
                   logical = sample(c(TRUE,FALSE), n, TRUE),
                   norm = rnorm(n),
                   count = rpois(n, lambda = c(5,15)))
train <- data[1:95, ]
test <- data[96:100, -1]
```

### 3.2 Formula interface

```{r formula_interface}
nb <- naive_bayes(class ~ ., train)
summary(nb)

# Classification
predict(nb, test, type = "class")
nb %class% test

# Posterior probabilities
predict(nb, test, type = "prob")
nb %prob% test

# Helper functions
tables(nb, 1)
get_cond_dist(nb)

# Note: all "numeric" (integer, double) variables are modelled
#       with Gaussian distribution by default.
```

### 3.3 Matrix/data.frame and class vector

```{r generalusage}
X <- train[-1]
class <- train$class
nb2 <- naive_bayes(x = X, y = class)
nb2 %prob% test
```



### 3.4 Non-parametric estimation for continuous features

Kernel density estimation can be used to estimate class conditional densities of continuous features. It has to be explicitly requested via the parameter `usekernel=TRUE` otherwise Gaussian distribution will be assumed. The estimation is performed with the built in `R` function `density()`. By default, Gaussian smoothing kernel and Silverman's rule of thumb as bandwidth selector are used:

```{r kde}
nb_kde <- naive_bayes(class ~ ., train, usekernel = TRUE)
summary(nb_kde)
get_cond_dist(nb_kde)
nb_kde %prob% test

# Class conditional densities
plot(nb_kde, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")

# Marginal densities
plot(nb_kde, "norm", arg.num = list(legend.cex = 0.9), prob = "marginal")
```


#### 3.4.1 Changing kernel

In general, there are 7 different smoothing kernels available:

- `gaussian` 
- `epanechnikov`
- `rectangular`
- `triangular`
- `biweight`
- `cosine`
- `optcosine`

and they can be specified in `naive_bayes()` via parameter additional parameter `kernel`. Gaussian kernel is the default smoothing kernel. Please see `density()` and `bw.nrd()` for further details.

```{r kde_kernel}
# Change Gaussian kernel to biweight kernel
nb_kde_biweight <- naive_bayes(class ~ ., train, usekernel = TRUE,
                               kernel = "biweight")
nb_kde_biweight %prob% test
plot(nb_kde_biweight, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")
```


#### 3.4.2 Changing bandwidth selector

The `density()` function offers 5 different bandwidth selectors, which can be specified via `bw` parameter:

- `nrd0` (Silverman's rule-of-thumb)
- `nrd` (variation of the rule-of-thumb)
- `ucv` (unbiased cross-validation)
- `bcv` (biased cross-validation)
- `SJ` (Sheather & Jones method)


```{r kde_bw}
nb_kde_SJ <- naive_bayes(class ~ ., train, usekernel = TRUE,
                               bw = "SJ")
nb_kde_SJ %prob% test
plot(nb_kde_SJ, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")
```

#### 3.4.3 Adjusting bandwidth

The parameter `adjust` allows to rescale the estimated bandwidth and thus introduces more flexibility to the estimation process. For values below 1 (no rescaling; default setting) the density becomes "wigglier" and for values above 1 the density tends to be "smoother":


```{r kde_adjust}
nb_kde_adjust <- naive_bayes(class ~ ., train, usekernel = TRUE,
                         adjust = 0.5)
nb_kde_adjust %prob% test
plot(nb_kde_adjust, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")
```


### 3.5 Model non-negative integers with Poisson distribution

Class conditional distributions of non-negative integer predictors can be modelled with Poisson distribution. This can be achieved by setting `usepoisson=TRUE` in the `naive_bayes()` function and by making sure that the variables representing counts in the dataset are of class `integer`.

```{r poisson}
is.integer(train$count)
nb_pois <- naive_bayes(class ~ ., train, usepoisson = TRUE)
summary(nb_pois)
get_cond_dist(nb_pois)

nb_pois %prob% test

# Class conditional distributions
plot(nb_pois, "count", prob = "conditional")

# Marginal distributions
plot(nb_pois, "count", prob = "marginal")
```

Owner

Name: Michal Majka
Login: majkamichal
Kind: user
Location: Vienna

Website: https://twitter.com/majkamichal
Repositories: 2
Profile: https://github.com/majkamichal

Stats @univienna @DukeU and @ErsteGroup www.stackoverflow.com/users/4956364/michal-majka

GitHub Events

Total

Push event: 1

Last Year

Push event: 1

Committers

Last synced: over 2 years ago

All Time

Total Commits: 236
Total Committers: 1
Avg Commits per committer: 236.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 7
Committers: 1
Avg Commits per committer: 7.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Michal Majka	m**a@h**m	236

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 6
Total pull requests: 3
Average time to close issues: about 2 months
Average time to close pull requests: 28 days
Total issue authors: 6
Total pull request authors: 3
Average comments per issue: 1.33
Average comments per pull request: 0.67
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

kohlkopf (1)
willtownes (1)
markvanderloo (1)
hyaochn (1)
AndreM84 (1)
BharatAD (1)

Pull Request Authors

majkamichal (1)
willtownes (1)
colin-fraser (1)

Top Labels

Issue Labels

bug (3) help wanted (2) wontfix (1)

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- cran 3,618 last-month
Total docker downloads: 21,870

Total dependent packages: 10
(may contain duplicates)
Total dependent repositories: 15
(may contain duplicates)
Total versions: 11
Total maintainers: 1

cran.r-project.org: naivebayes

High Performance Implementation of the Naive Bayes Algorithm

Homepage: https://github.com/majkamichal/naivebayes
Documentation: http://cran.r-project.org/web/packages/naivebayes/naivebayes.pdf
License: GPL-2
Latest release: 1.0.0
published about 2 years ago

Versions: 9
Dependent Packages: 10
Dependent Repositories: 13
Downloads: 3,618 Last month
Docker Downloads: 21,870

Rankings

Dependent packages count: 5.3%

Downloads: 5.5%

Dependent repos count: 8.0%

Stargazers count: 9.0%

Average: 10.5%

Forks count: 10.8%

Docker downloads count: 24.2%

Maintainers (1)

michalmajka@hotmail.com

Last synced: 9 months ago

conda-forge.org: r-naivebayes

Homepage: https://github.com/majkamichal/naivebayes, https://majkamichal.github.io/naivebayes/
License: GPL-2
Latest release: 0.9.7
published about 6 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 2

Rankings

Dependent repos count: 20.2%

Average: 42.5%

Stargazers count: 46.1%

Dependent packages count: 51.6%

Forks count: 52.2%

Last synced: 10 months ago

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

naivebayes

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.Rmd

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: naivebayes

Rankings

Maintainers (1)

conda-forge.org: r-naivebayes

Rankings

Dependencies