Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org, scholar.google -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.7%) to scientific vocabulary
Keywords from Contributors
tidy-data
tidyverse
documentation-tool
setup
data-manipulation
grammar
shiny
c-plus-plus-11
c-plus-plus-14
c-plus-plus-17
Last synced: 10 months ago
·
JSON representation
Repository
Extra recipes for predictor embeddings
Basic Info
- Host: GitHub
- Owner: tidymodels
- License: other
- Language: R
- Default Branch: main
- Homepage: https://embed.tidymodels.org
- Size: 15.9 MB
Statistics
- Stars: 143
- Watchers: 11
- Forks: 19
- Open Issues: 25
- Releases: 15
Created about 8 years ago
· Last pushed 10 months ago
Metadata Files
Readme
Changelog
Contributing
License
Code of conduct
README.Rmd
---
output: github_document
---
```{r}
#| echo: false
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```
# embed
[](https://github.com/tidymodels/embed/actions/workflows/R-CMD-check.yaml)
[](https://app.codecov.io/gh/tidymodels/embed?branch=main)
[](https://CRAN.r-project.org/package=embed)
[](https://CRAN.r-project.org/package=embed)

## Introduction
`embed` has extra steps for the [`recipes`](https://recipes.tidymodels.org/) package for embedding predictors into one or more numeric columns. Almost all of the preprocessing methods are _supervised_.
These steps are available here in a separate package because the step dependencies, [`rstanarm`](https://CRAN.r-project.org/package=rstanarm), [`lme4`](https://CRAN.r-project.org/package=lme4), and [`keras3`](https://CRAN.r-project.org/package=keras3), are fairly heavy.
Some steps handle categorical predictors:
* `step_lencode_glm()`, `step_lencode_bayes()`, and `step_lencode_mixed()` estimate the effect of each of the factor levels on the outcome and these estimates are used as the new encoding. The estimates are estimated by a generalized linear model. This step can be executed without pooling (via `glm`) or with partial pooling (`stan_glm` or `lmer`). Currently implemented for numeric and two-class outcomes.
* `step_embed()` uses `keras3::layer_embedding` to translate the original _C_ factor levels into a set of _D_ new variables (< _C_). The model fitting routine optimizes which factor levels are mapped to each of the new variables as well as the corresponding regression coefficients (i.e., neural network weights) that will be used as the new encodings.
* `step_woe()` creates new variables based on weight of evidence encodings.
* `step_feature_hash()` can create indicator variables using feature hashing.
For numeric predictors:
* `step_umap()` uses a nonlinear transformation similar to t-SNE but can be used to project the transformation on new data. Both supervised and unsupervised methods can be used.
* `step_discretize_xgb()` and `step_discretize_cart()` can make binned versions of numeric predictors using supervised tree-based models.
* `step_pca_sparse()` and `step_pca_sparse_bayes()` conduct feature extraction with sparsity of the component loadings.
Some references for these methods are:
* Francois C and Allaire JJ (2018) [_Deep Learning with R_](https://www.manning.com/books/deep-learning-with-r), Manning
* Guo, C and Berkhahn F (2016) "[Entity Embeddings of Categorical Variables](https://arxiv.org/abs/1604.06737)"
* Micci-Barreca D (2001) "[A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=A+preprocessing+scheme+for+high-cardinality+categorical+attributes+in+classification+and+prediction+problems&btnG=)," ACM SIGKDD Explorations Newsletter, 3(1), 27-32.
* Zumel N and Mount J (2017) "[`vtreat`: a `data.frame` Processor for Predictive Modeling](https://arxiv.org/abs/1611.09477)"
* McInnes L and Healy J (2018) [UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction](https://arxiv.org/abs/1802.03426)
* Good, I. J. (1985), "[Weight of evidence: A brief survey](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Weight+of+evidence%3A+A+brief+survey&btnG=)", Bayesian Statistics, 2, pp.249-270.
## Getting Started
There are two articles that walk through how to use these embedding steps, using [generalized linear models](https://embed.tidymodels.org/articles/Applications/GLM.html) and [neural networks built via TensorFlow](https://embed.tidymodels.org/articles/Applications/Tensorflow.html).
## Installation
To install the package:
```r
install.packages("embed")
```
Note that to use some steps, you will also have to install other packages such as `rstanarm` and `lme4`. For all of the steps to work, you may want to use:
```r
install.packages(c("rpart", "xgboost", "rstanarm", "lme4"))
```
To get a bug fix or to use a feature from the development version, you can install the development version of this package from GitHub.
```r
# install.packages("pak")
pak::pak("tidymodels/embed")
```
## Contributing
This project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.
- For questions and discussions about tidymodels packages, modeling, and machine learning, please [post on RStudio Community](https://forum.posit.co/new-topic?category_id=15&tags=tidymodels,question).
- If you think you have encountered a bug, please [submit an issue](https://github.com/tidymodels/embed/issues).
- Either way, learn how to create and share a [reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html) (a minimal, reproducible example), to clearly communicate about your code.
- Check out further details on [contributing guidelines for tidymodels packages](https://www.tidymodels.org/contribute/) and [how to get help](https://www.tidymodels.org/help/).
Owner
- Name: tidymodels
- Login: tidymodels
- Kind: organization
- Repositories: 59
- Profile: https://github.com/tidymodels
GitHub Events
Total
- Create event: 11
- Release event: 1
- Issues event: 19
- Watch event: 1
- Delete event: 10
- Issue comment event: 9
- Push event: 34
- Pull request event: 25
- Fork event: 1
Last Year
- Create event: 11
- Release event: 1
- Issues event: 19
- Watch event: 1
- Delete event: 10
- Issue comment event: 9
- Push event: 34
- Pull request event: 25
- Fork event: 1
Committers
Last synced: about 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Emil Hvitfeldt | e****t@g****m | 325 |
| topepo | m****n@g****m | 248 |
| Julia Silge | j****e@g****m | 21 |
| Hannah Frick | h****h@r****m | 15 |
| Konrad | k****h@g****m | 12 |
| Daniel Falbel | d****l@g****m | 5 |
| simonpcouch | s****h@g****m | 3 |
| James Wade | j****e@d****m | 2 |
| artichaud1 | k****i@h****m | 2 |
| Athospd | a****i@g****m | 1 |
| Cory Brunson | c****d@g****m | 1 |
| DavisVaughan | d****s@r****m | 1 |
| Dirk Eddelbuettel | e****d@d****g | 1 |
| Gábor Csárdi | c****r@g****m | 1 |
| Sean Ingerson | 3****n | 1 |
| Timothy Mastny | t****y@g****m | 1 |
| asiripanich | 1****h | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 64
- Total pull requests: 119
- Average time to close issues: 2 months
- Average time to close pull requests: 4 days
- Total issue authors: 16
- Total pull request authors: 7
- Average comments per issue: 1.64
- Average comments per pull request: 0.92
- Merged pull requests: 115
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 18
- Pull requests: 20
- Average time to close issues: about 2 months
- Average time to close pull requests: about 8 hours
- Issue authors: 3
- Pull request authors: 2
- Average comments per issue: 0.11
- Average comments per pull request: 0.05
- Merged pull requests: 16
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- EmilHvitfeldt (38)
- juliasilge (4)
- exsell-jc (2)
- talegari (2)
- naveranoc (2)
- jackobenco016 (2)
- topepo (2)
- hfrick (2)
- AndrewKostandy (1)
- diegoperoni (1)
- mkhansa (1)
- dgrtwo (1)
- jrosell (1)
- wbuchanan (1)
- jlmelville (1)
Pull Request Authors
- EmilHvitfeldt (109)
- topepo (16)
- JamesHWade (2)
- juliasilge (2)
- gaborcsardi (2)
- simonpcouch (2)
- corybrunson (2)
Top Labels
Issue Labels
feature (21)
upkeep (11)
bug (10)
target encoding (5)
documentation (4)
tidy-dev-day :nerd_face: (2)
reprex (1)
good first issue :heart: (1)
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 1,594 last-month
- Total docker downloads: 87
- Total dependent packages: 2
- Total dependent repositories: 11
- Total versions: 21
- Total maintainers: 1
cran.r-project.org: embed
Extra Recipes for Encoding Predictors
- Homepage: https://embed.tidymodels.org
- Documentation: http://cran.r-project.org/web/packages/embed/embed.pdf
- License: MIT + file LICENSE
-
Latest release: 1.2.0
published 10 months ago
Rankings
Stargazers count: 2.9%
Forks count: 4.6%
Dependent repos count: 8.8%
Average: 11.1%
Downloads: 12.0%
Dependent packages count: 13.7%
Docker downloads count: 24.9%
Maintainers (1)
Last synced:
10 months ago
Dependencies
DESCRIPTION
cran
- R >= 3.4 depends
- recipes >= 1.0.0 depends
- dplyr * imports
- generics >= 0.1.0 imports
- glue * imports
- keras * imports
- lifecycle * imports
- purrr * imports
- rlang >= 0.4.10 imports
- rsample * imports
- stats * imports
- tensorflow * imports
- tibble * imports
- tidyr * imports
- utils * imports
- uwot * imports
- withr * imports
- VBsparsePCA * suggests
- covr * suggests
- ggplot2 * suggests
- irlba * suggests
- knitr * suggests
- lme4 * suggests
- modeldata * suggests
- rmarkdown * suggests
- rpart * suggests
- rstanarm * suggests
- stringdist * suggests
- testthat >= 3.0.0 suggests
- xgboost * suggests