Category Encoders

Category Encoders: a scikit-learn-contrib package of transformers for encoding categorical data - Published in JOSS (2018)

https://github.com/scikit-learn-contrib/category_encoders

Science Score: 59.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org, springer.com, zenodo.org
✓
Committers with academic emails
3 of 69 committers (4.3%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.4%) to scientific vocabulary

Scientific Fields

Economics Social Sciences - 63% confidence

Engineering Computer Science - 60% confidence

Artificial Intelligence and Machine Learning Computer Science - 44% confidence

Last synced: 6 months ago · JSON representation

Repository

A library of sklearn compatible categorical variable encoders

Basic Info

Host: GitHub
Owner: scikit-learn-contrib
License: bsd-3-clause
Language: Python
Default Branch: master
Homepage: http://contrib.scikit-learn.org/category_encoders/
Size: 43.2 MB

Statistics

Stars: 2,459
Watchers: 38
Forks: 404
Open Issues: 44
Releases: 22

Created about 10 years ago · Last pushed 8 months ago

Metadata Files

Readme Changelog Contributing License Code of conduct

Categorical Encoding Methods

A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques.

Important Links

Documentation: http://contrib.scikit-learn.org/category_encoders/

Encoding Methods

Unsupervised: * Backward Difference Contrast [2][3] * BaseN [6] * Binary [5] * Gray [14] * Count [10] * Hashing [1] * Helmert Contrast [2][3] * Ordinal [2][3] * One-Hot [2][3] * Rank Hot [15] * Polynomial Contrast [2][3] * Sum Contrast [2][3]

Supervised: * CatBoost [11] * Generalized Linear Mixed Model [12] * James-Stein Estimator [9] * LeaveOneOut [4] * M-estimator [7] * Target Encoding [7] * Weight of Evidence [8] * Quantile Encoder [13] * Summary Encoder [13]

Installation

The package requires: numpy, statsmodels, and scipy.

To install the package, execute:

shell $ python setup.py install

shell pip install category_encoders

shell conda install -c conda-forge category_encoders

To install the development version, you may use:

shell pip install --upgrade git+https://github.com/scikit-learn-contrib/category_encoders

Usage

All of the encoders are fully compatible sklearn transformers, so they can be used in pipelines or in your existing scripts. Supported input formats include numpy arrays and pandas dataframes. If the cols parameter isn't passed, all columns with object or pandas categorical data type will be encoded. Please see the docs for transformer-specific configuration options.

Examples

There are two types of encoders: unsupervised and supervised. An unsupervised example: ```python from categoryencoders import * import pandas as pd from sklearn.datasets import loadboston

prepare some data

bunch = loadboston() y = bunch.target X = pd.DataFrame(bunch.data, columns=bunch.featurenames)

use binary encoding to encode two categorical features

enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X)

transform the dataset

numeric_dataset = enc.transform(X) ```

And a supervised example: ```python from categoryencoders import * import pandas as pd from sklearn.datasets import loadboston

prepare some data

bunch = loadboston() ytrain = bunch.target[0:250] ytest = bunch.target[250:506] Xtrain = pd.DataFrame(bunch.data[0:250], columns=bunch.featurenames) Xtest = pd.DataFrame(bunch.data[250:506], columns=bunch.feature_names)

use target encoding to encode two categorical features

enc = TargetEncoder(cols=['CHAS', 'RAD'])

transform the datasets

trainingnumericdataset = enc.fittransform(Xtrain, ytrain) testingnumericdataset = enc.transform(Xtest) ```

For the transformation of the training data with the supervised methods, you should use fit_transform() method instead of fit().transform(), because these two methods do not have to generate the same result. The difference can be observed with LeaveOneOut encoder, which performs a nested cross-validation for the training data in fit_transform() method (to decrease over-fitting of the downstream model) but uses all the training data for scoring with transform() method (to get as accurate estimates as possible).

Furthermore, you may benefit from following wrappers: * PolynomialWrapper, which extends supervised encoders to support polynomial targets * NestedCVWrapper, which helps to prevent overfitting

Additional examples and benchmarks can be found in the examples directory.

Contributing

Category encoders is under active development, if you'd like to be involved, we'd love to have you. Check out the CONTRIBUTING.md file or open an issue on the github project to get started.

References

Kilian Weinberger; Anirban Dasgupta; John Langford; Alex Smola; Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. Proc. ICML.
Contrast Coding Systems for categorical variables. UCLA: Statistical Consulting Group. From https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/.
Gregory Carey (2003). Coding Categorical Variables. From http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf
Owen Zhang - Leave One Out Encoding. From https://datascience.stackexchange.com/questions/10839/what-is-difference-between-one-hot-encoding-and-leave-one-out-encoding
Beyond One-Hot: an exploration of categorical variables. From https://mcginniscommawill.com/posts/2015-11-29-beyond-one-hot-an-exploration-of-categorical-variables/
BaseN Encoding and Grid Search in categorical variables. From https://mcginniscommawill.com/posts/2016-12-18-basen-encoding-grid-search-category-encoders/
Daniele Miccii-Barreca (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1. From http://dx.doi.org/10.1145/507533.507538
Weight of Evidence (WOE) and Information Value Explained. From https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
Empirical Bayes for multiple sample sizes. From http://chris-said.io/2017/05/03/empirical-bayes-for-multiple-sample-sizes/
Simple Count or Frequency Encoding. From https://www.datacamp.com/community/tutorials/encoding-methodologies
Transforming categorical features to numerical features. From https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/
Andrew Gelman and Jennifer Hill (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. From https://faculty.psau.edu.sa/filedownload/doc-12-pdf-a1997d0d31f84d13c1cdc44ac39a8f2c-original.pdf
Carlos Mougan, David Masip, Jordi Nin and Oriol Pujol (2021). Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems. Modeling Decisions for Artificial Intelligence, 2021. Springer International Publishing https://link.springer.com/chapter/10.1007%2F978-3-030-85529-1_14
Gray Encoding. From https://en.wikipedia.org/wiki/Gray_code
Jacob Buckman, Aurko Roy, Colin Raffel, Ian Goodfellow: Thermometer Encoding: One Hot Way To Resist Adversarial Examples. From https://openreview.net/forum?id=S18Su--CW
Fairness implications of encoding protected categorical attributes. Carlos Mougan, Jose Alvarez, Salvatore Ruggieri, and Steffen Staab. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, https://arxiv.org/abs/2201.11358

Owner

Name: scikit-learn-contrib
Login: scikit-learn-contrib
Kind: organization

Website: http://contrib.scikit-learn.org
Repositories: 27
Profile: https://github.com/scikit-learn-contrib

scikit-learn compatible projects

GitHub Events

Total

Create event: 5
Release event: 3
Issues event: 17
Watch event: 60
Issue comment event: 15
Push event: 9
Pull request event: 12
Fork event: 12

Last Year

Create event: 5
Release event: 3
Issues event: 17
Watch event: 60
Issue comment event: 15
Push event: 9
Pull request event: 12
Fork event: 12

Committers

Last synced: 7 months ago

All Time

Total Commits: 816
Total Committers: 69
Avg Commits per committer: 11.826
Development Distribution Score (DDS): 0.814

Past Year

Commits: 18
Committers: 3
Avg Commits per committer: 6.0
Development Distribution Score (DDS): 0.111

Top Committers

Name	Email	Commits
Jan Motl	j**n@m**s	152
jcastaldo08	j**8@g**m	100
Will McGinnis	w**l@p**m	96
paul	p**r@w**e	63
SLLiu	s**6@1**m	60
florian	d**n@a**h	34
Will McGinnis	w**l@p**m	32
Carlos Mougan	c**n@g**m	26
Lisa	l**l@g**m	24
Ben Reiniger	4****r	20
PaulWestenthanner	p**l@w**v	17
florian	c**t@a**h	16
joshua.dunn	j**n@e**m	15
makrobios	b**h@g**m	15
JaimeArboleda	j**a@g**m	10
Will McGinnis	w****6	10
bkhant1	b**n@g**m	9
anjum	a**8@g**m	7
Gleb Levitski	3****v	7
hhy	h**y@1**m	7
Nicholas Bollweg	n**g@g**m	6
david26694	d**4@g**m	6
taowenwu	7**9@q**m	6
Chapman Siu	c**u@g**m	5
Mavs	m**7@g**m	5
Rishoban	r**7@g**m	5
Cameron Davison	c**n@n**m	4
John Hopfensperger	4****h	4
Gijsbers	p**s@t**l	4
jona.sassenhagen	j**n@d**m	3
and 39 more...

Committer Domains (Top 20 + Academic)

againstthecurrent.ch: 2 motl.us: 1 predikto.com: 1 163.com: 1 pedalwrencher.com: 1 westenthanner.dev: 1 engie.com: 1 126.com: 1 qq.com: 1 novilabs.com: 1 tue.nl: 1 datarobot.com: 1 contractors.roche.com: 1 hotmaul.co.uk: 1 lesara.de: 1 iqgateway.com: 1 suncorp.com.au: 1 ly.st: 1 oecu.jp: 1 student.ethz.ch: 1 columbia.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 101
Total pull requests: 67
Average time to close issues: over 1 year
Average time to close pull requests: 4 months
Total issue authors: 84
Total pull request authors: 28
Average comments per issue: 3.53
Average comments per pull request: 1.81
Merged pull requests: 50
Bot issues: 0
Bot pull requests: 4

Past Year

Issues: 9
Pull requests: 16
Average time to close issues: 28 days
Average time to close pull requests: 11 days
Issue authors: 7
Pull request authors: 5
Average comments per issue: 0.78
Average comments per pull request: 0.5
Merged pull requests: 9
Bot issues: 0
Bot pull requests: 4

View more stats

Top Authors

Issue Authors

PaulWestenthanner (8)
janmotl (4)
bmreiniger (3)
wdm0006 (2)
JoshuaC3 (2)
eddietaylor (2)
willsthompson (2)
tvdboom (2)
DZIMDZEM (1)
CoteDave (1)
PraveshKoirala (1)
TobiasSackmannDacoso (1)
euisuk-chung (1)
nexusme (1)
iuiu34 (1)

Pull Request Authors

PaulWestenthanner (25)
glevv (4)
dependabot[bot] (4)
bmreiniger (4)
fullflu (2)
bkhant1 (2)
nercisla (2)
Jordanbarker (2)
marekschneider (2)
s-banach (2)
tvdboom (2)
bollwyvl (2)
ederson-souza (1)
woodly0 (1)
dennisobrien (1)

Top Labels

Issue Labels

enhancement (22) good first issue (8) bug (8) help wanted (7) non-reproducible (6) question (3) discussion (3) wontfix (2) documentation (2)

Pull Request Labels

dependencies (4) python (4)

Packages

Total packages: 3
Total downloads:
- pypi 29 last-month

Total dependent packages: 9
(may contain duplicates)
Total dependent repositories: 25
(may contain duplicates)
Total versions: 25
Total maintainers: 1

conda-forge.org: category_encoders

A set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques. While ordinal, one-hot, and hashing encoders have similar equivalents in the existing scikit-learn version, the transformers in this library all share a few useful properties: - First-class support for pandas dataframes as an input (and optionally as output) - Can explicitly configure which columns in the data are encoded by name or index, or infer non-numeric columns regardless of input type - Can drop any columns with very low variance based on training set optionally - Portability: train a transformer on data, pickle it, reuse it later and get the same thing out. - Full compatibility with sklearn pipelines, input an array-like dataset like any other transformer

Homepage: https://github.com/scikit-learn-contrib/category_encoders
License: BSD-3-Clause
Latest release: 2.5.0
published over 3 years ago

Versions: 16
Dependent Packages: 7
Dependent Repositories: 12

Rankings

Dependent packages count: 8.0%

Forks count: 8.5%

Stargazers count: 8.5%

Average: 8.8%

Dependent repos count: 10.1%

Last synced: 6 months ago

pypi.org: category-encoders-dev

A collection sklearn transformers to encode categorical variables as numeric

Homepage: https://github.com/scikit-learn-contrib/category_encoders
Documentation: https://category-encoders-dev.readthedocs.io/
License: BSD
Latest release: 2.2.2.post2021
published over 4 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 29 Last month

Rankings

Stargazers count: 1.5%

Forks count: 2.7%

Dependent packages count: 7.3%

Average: 11.5%

Dependent repos count: 22.1%

Downloads: 24.1%

Maintainers (1)

minrock

Last synced: 6 months ago

anaconda.org: category_encoders

Homepage: https://github.com/scikit-learn-contrib/category_encoders
License: BSD-3-Clause
Latest release: 2.8.1
published 7 months ago

Versions: 8
Dependent Packages: 2
Dependent Repositories: 12

Rankings

Stargazers count: 17.1%

Forks count: 17.1%

Dependent packages count: 20.4%

Average: 22.8%

Dependent repos count: 36.7%

Last synced: 6 months ago

Dependencies

.github/workflows/docs.yml actions

actions/checkout v2 composite
ammaraskar/sphinx-action master composite
peaceiris/actions-gh-pages v3 composite

.github/workflows/pypi-publish.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite
pypa/gh-action-pypi-publish master composite

.github/workflows/test-docs-build.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite
ammaraskar/sphinx-action master composite

.github/workflows/test-suite.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

docs/requirements.txt pypi

numpy >=1.14.0
numpydoc *
pandas >=0.21.1
patsy >=0.5.1
scikit-learn >=0.20.0
scipy >=1.0.0
sphinx >=3.0
sphinx_rtd_theme *
statsmodels >=0.9.0
unittest2 *

requirements-dev.txt pypi

numpydoc * development
pytest * development
sphinx * development
sphinx_rtd_theme * development

requirements.txt pypi

numpy >=1.14.0
pandas >=1.0.5
patsy >=0.5.1
scikit-learn >=1.0.0
scipy >=1.0.0
statsmodels >=0.9.0
unittest2 *

setup.py pypi

numpy >=1.14.0
pandas >=1.0.5
patsy >=0.5.1
scikit-learn >=0.20.0
scipy >=1.0.0
statsmodels >=0.9.0

Category Encoders

Science Score: 59.0%

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.md

Categorical Encoding Methods

Important Links

Encoding Methods

Installation

Usage

Examples

prepare some data

use binary encoding to encode two categorical features

transform the dataset

prepare some data

use target encoding to encode two categorical features

transform the datasets

Contributing

References

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

conda-forge.org: category_encoders

Rankings

pypi.org: category-encoders-dev

Rankings

Maintainers (1)

anaconda.org: category_encoders

Rankings

Dependencies