Imbalance

Imbalance: A comprehensive multi-interface Julia toolbox to address class imbalance - Published in JOSS (2024)

https://github.com/juliaai/imbalance.jl

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 7 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org
○
Committers with academic emails
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

class-imbalance classification machine-learning

Scientific Fields

Earth and Environmental Sciences Physical Sciences - 40% confidence

Last synced: 6 months ago · JSON representation

Repository

A Julia toolbox with resampling methods to correct for class imbalance.

Basic Info

Host: GitHub
Owner: JuliaAI
License: mit
Language: Julia
Default Branch: dev
Homepage: https://juliaai.github.io/Imbalance.jl/dev/
Size: 43.5 MB

Statistics

Stars: 29
Watchers: 4
Forks: 1
Open Issues: 2
Releases: 7

Topics

class-imbalance classification machine-learning

Created over 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License

Imbalance.jl

Imbalance

A Julia package with resampling methods to correct for class imbalance in a wide variety of classification settings.

⏬ Installation

julia import Pkg; Pkg.add("Imbalance")

✨ Implemented Methods

The package implements the following resampling algorithms

Random Oversampling
Random Walk Oversampling (RWO)
Random Oversampling Examples (ROSE)
Synthetic Minority Oversampling Technique (SMOTE)
Borderline SMOTE1
SMOTE-Nominal (SMOTE-N)
SMOTE-Nominal Categorical (SMOTE-NC)
Random Undersampling
Cluster Undersampling
EditedNearestNeighbors Undersampling
Tomek Links Undersampling
Balanced Bagging Classifier (@MLJBalancing.jl)

To see various examples where such methods help improve classification performance, check the tutorials section of the documentation.

Interested in contributing with more? Check this.

🚀 Quick Start

We will illustrate using the package to oversample withSMOTE; however, all other implemented oversampling methods follow the same pattern.

Let's start by generating some dummy imbalanced data:

```julia using Imbalance

Set dataset properties then generate imbalanced data

classprobs = [0.5, 0.2, 0.3] # probability of each class
numrows, numcontinuousfeats = 100, 5 X, y = generateimbalanceddata(numrows, numcontinuousfeats; classprobs, rng=42) ``In following code blocks, it will be assumed thatXandy` are readily available.

🔵 Standard API

All methods by default support a pure functional interface. ```julia using Imbalance

Apply SMOTE to oversample the classes

Xover, yover = smote(X, y; k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42) ```

🤖 MLJ Interface

All methods support the MLJ interface where instead of directly calling the method, one instantiates a model for the method while optionally passing the keyword parameters found in the functional interface then wraps the model in a machine and follows by calling transform on the machine and data. ```julia using MLJ

Load the model

SMOTE = @load SMOTE pkg=Imbalance

Create an instance of the model

oversampler = SMOTE(k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42)

Wrap it in a machine

mach = machine(oversampler)

Provide the data to transform

Xover, yover = transform(mach, X, y) ``All implemented oversampling methods are considered static transforms and hence, nofit` is required.

Pipelining Models

If MLJBalancing is also used, an arbitrary number of resampling methods from Imbalance.jl can be wrapped with a classification model from MLJ to function as a unified model where resampling automatically takes place on given data before training the model (and is bypassed during prediction).

```julia using MLJ, MLJBalancing

grab two resamplers and a classifier

LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0 SMOTE = @load SMOTE pkg=Imbalance verbosity=0 TomekUndersampler = @load TomekUndersampler pkg=Imbalance verbosity=0

oversampler = SMOTE(k=5, ratios=1.0, rng=42) undersampler = TomekUndersampler(minratios=0.5, rng=42) logisticmodel = LogisticClassifier()

wrap the oversampler, undersample and classification model together

balancedmodel = BalancedModel(model=logisticmodel, balancer1=oversampler, balancer2=undersampler)

behaves like a single model

mach = machine(balanced_model, X, y); fit!(mach, verbosity=0) predict(mach, X) ```

🏓 Table Transforms Interface

The TableTransforms interface operates on single tables; it assumes that y is one of the columns of the given table. Thus, it follows a similar pattern to the MLJ interface except that the index of y is a required argument while instantiating the model and the data to be transformed via apply is only one table Xy. ```julia using Imbalance using Imbalance.TableTransforms using TableTransforms

Generate imbalanced data

numrows = 200 numfeatures = 5 yind = 3 Xy, _ = generateimbalanceddata(numrows, numfeatures; classprobs=[0.5, 0.2, 0.3], inserty=yind, rng=42)

Initiate SMOTE model

oversampler = SMOTE(y_ind; k=5, ratios=Dict(0=>1.0, 1=> 0.9, 2=>0.8), rng=42) Xyover = Xy |> oversampler # can chain with other table transforms

equivalently if TableTransforms is used

Xyover, cache = TableTransforms.apply(oversampler, Xy) # equivalently ``Thereapply(oversampler, Xy, cache)method fromTableTransformssimply falls back toapply(oversample, Xy)and therevert(oversampler, Xy, cache)` reverts the transform by removing the oversampled observations from the table.

Notice that because the interfaces of MLJ and TableTransforms use the same model names, you will have to specify the source of the model if both are used in the same file (e.g., Imbalance.TableTransforms.SMOTE) for the example above.

🎨 Features

Supports multi-class variants of the algorithms and both nominal and continuous features
Provides MLJ and TableTransforms interfaces aside from the default pure functional interface
Generic by supporting table input/output formats as well as matrices
Supports tables regardless to whether the target is a separate column or one of the columns
Supports automatic encoding and decoding of nominal features

📜 Rationale

Most if not all machine learning algorithms can be viewed as a form of empirical risk minimization where the object is to find the parameters $\theta$ that for some loss function $L$ minimize

$$\hat{\theta} = \arg\min{\theta} \frac{1}{N} \sum{i=1}^{N} L(f{\theta}(xi), y_i)$$

The underlying assumption is that minimizing this empirical risk corresponds to approximately minimizing the true risk which considers all examples in the populations which would imply that $f_\theta$ is approximately the true target function $f$ that we seek to model.

In a multi-class setting with $K$ classes, one can write

$$\hat{\theta} = \arg\min{\theta} \left( \frac{1}{N1} \sum{i \in C1} L(f{\theta}(xi), yi) + \frac{1}{N2} \sum{i \in C2} L(f{\theta}(xi), yi) + \ldots + \frac{1}{NK} \sum{i \in CK} L(f{\theta}(xi), y_i) \right)$$

Class imbalance occurs when some classes have much fewer examples than other classes. In this case, the terms corresponding to smaller classes contribute minimally to the sum which makes it possible for any learning algorithm to find an approximate solution to minimizing the empirical risk that mostly only minimizes the over the significant sums. This yields a hypothesis $f_\theta$ that may be very different from the true target $f$ with respect to the minority classes which may be the most important for the application in question.

One obvious possible remedy is to weight the smaller sums so that a learning algorithm more easily avoids approximate solutions that exploit their insignificance which can be seen to be equivalent to repeating examples of the observations in minority classes. This can be achieved by naive random oversampling which is offered by this package along with other more advanced oversampling methods that function by generating synthetic data or deleting existing ones. You can read more about the class imbalance problem and learn about various algorithms implemented in this package by reading this series of articles on Medium.

To our knowledge, there are no existing maintained Julia packages that implement resampling algorithms for multi-class classification problems or that handle both nominal and continuous features. This has served as a primary motivation for the creation of this package.

👥 Credits

This package was created by Essam Wisam as a Google Summer of Code project, under the mentorship of Anthony Blaom. Special thanks also go to Rik Huijzer for his friendliness and the binary SMOTE implementation in Resample.jl.

You may cite the following paper should you use Imbalance.jl or MLJBalancing.jl in a scientific publication @article{ Wisam2024, doi = {10.21105/joss.06310}, url = {https://doi.org/10.21105/joss.06310}, year = {2024}, publisher = {The Open Journal}, volume = {9}, number = {95}, pages = {6310}, author = {Essam Wisam and Anthony Blaom}, title = {Imbalance: A comprehensive multi-interface Julia toolbox to address class imbalance}, journal = {Journal of Open Source Software} }

Owner

Name: JuliaAI
Login: JuliaAI
Kind: organization

Website: https://github.com/alan-turing-institute/MLJ.jl
Repositories: 47
Profile: https://github.com/JuliaAI

Home for repositories of the MLJ (Machine Learning in Julia) project

JOSS Publication

Imbalance: A comprehensive multi-interface Julia toolbox to address class imbalance

Published

March 18, 2024

DOI

10.21105/joss.06310

Volume 9, Issue 95, Page 6310

Authors

Essam Wisam

Cairo University, Egypt

Anthony Blaom

University of Auckland, New Zealand

Editor

Mehmet Hakan Satman

GitHub Events

Total

Issues event: 2
Watch event: 1
Issue comment event: 3
Pull request event: 1

Last Year

Issues event: 2
Watch event: 1
Issue comment event: 3
Pull request event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 315
Total Committers: 3
Avg Commits per committer: 105.0
Development Distribution Score (DDS): 0.022

Past Year

Commits: 2
Committers: 1
Avg Commits per committer: 2.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Essam	e**m@o**m	308
Anthony D. Blaom	a**m@g**m	5
Antonello Lobianco	s****s	2

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 69
Total pull requests: 32
Average time to close issues: 13 days
Average time to close pull requests: 2 days
Total issue authors: 6
Total pull request authors: 4
Average comments per issue: 2.33
Average comments per pull request: 1.44
Merged pull requests: 29
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: about 23 hours
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 3.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

ablaom (50)
EssamWisam (5)
sylvaticus (4)
ArneTillmann (3)
math4mad (1)
JuliaTagBot (1)

Pull Request Authors

EssamWisam (34)
ablaom (5)
sylvaticus (4)
jbytecode (2)

Top Labels

Issue Labels

invalid (1) documentation (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- julia 14 total

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 7

juliahub.com: Imbalance

A Julia toolbox with resampling methods to correct for class imbalance.

Homepage: https://juliaai.github.io/Imbalance.jl/dev/
Documentation: https://docs.juliahub.com/General/Imbalance/stable/
License: MIT
Latest release: 0.1.6
published almost 2 years ago

Versions: 7
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 14 Total

Rankings

Dependent repos count: 10.1%

Dependent packages count: 37.1%

Average: 40.0%

Forks count: 53.7%

Stargazers count: 59.0%

Last synced: 6 months ago

Dependencies

.github/workflows/Documenter.yml actions

actions/checkout v2 composite
julia-actions/setup-julia latest composite

.github/workflows/CompatHelper.yml actions

.github/workflows/TagBot.yml actions

JuliaRegistries/TagBot v1 composite

.github/workflows/CI.yml actions

actions/checkout v2 composite
codecov/codecov-action v2 composite
julia-actions/julia-buildpkg latest composite
julia-actions/julia-processcoverage v1 composite
julia-actions/julia-runtest latest composite
julia-actions/setup-julia latest composite

.github/workflows/formatter.yml actions

actions/checkout v2 composite
peter-evans/create-pull-request v3 composite

Imbalance

Science Score: 93.0%

Keywords

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Imbalance.jl

⏬ Installation

✨ Implemented Methods

🚀 Quick Start

Set dataset properties then generate imbalanced data

🔵 Standard API

Apply SMOTE to oversample the classes

🤖 MLJ Interface

Load the model

Create an instance of the model

Wrap it in a machine

Provide the data to transform

Pipelining Models

grab two resamplers and a classifier

wrap the oversampler, undersample and classification model together

behaves like a single model

🏓 Table Transforms Interface

Generate imbalanced data

Initiate SMOTE model

equivalently if TableTransforms is used

🎨 Features

📜 Rationale

👥 Credits

Owner

JOSS Publication

Imbalance: A comprehensive multi-interface Julia toolbox to address class imbalance

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

juliahub.com: Imbalance

Rankings

Dependencies