BetaML

BetaML: The Beta Machine Learning Toolkit, a self-contained repository of Machine Learning algorithms in Julia - Published in JOSS (2021)

https://github.com/sylvaticus/betaml.jl

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 10 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org
✓
Committers with academic emails
1 of 8 committers (12.5%) from academic institutions
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

ai artificial-intelligence autoencoder classification clustering data-science decision-trees deep-learning feature-importance imputation julia machine-learning ml neural-networks pca random-forest regression

Keywords from Contributors

pde normalizing-flow meshing finite-volume fluid-dynamics climate-change correlation standardization ensemble-learning pipelines

Scientific Fields

Earth and Environmental Sciences Physical Sciences - 40% confidence

Psychology Social Sciences - 40% confidence

Last synced: 6 months ago · JSON representation

Repository

Beta Machine Learning Toolkit

Basic Info

Host: GitHub
Owner: sylvaticus
License: mit
Language: Julia
Default Branch: master
Homepage:
Size: 35.4 MB

Statistics

Stars: 100
Watchers: 5
Forks: 13
Open Issues: 6
Releases: 38

Topics

Created almost 6 years ago · Last pushed 6 months ago

Metadata Files

Readme License

README.md

Beta Machine Learning Toolkit

Machine Learning made simple :-)

The Beta Machine Learning Toolkit is a package including many algorithms and utilities to implement machine learning workflows in Julia, Python, R and any other language with a Julia binding.

Currently the following models are available:

| BetaML name | MLJ Interface | Category | | ----------- | ------------- | -------- | | PerceptronClassifier | PerceptronClassifier | Supervised classifier | | KernelPerceptronClassifier | KernelPerceptronClassifier | Supervised classifier | | PegasosClassifier | PegasosClassifier | Supervised classifier | | DecisionTreeEstimator | DecisionTreeClassifier, DecisionTreeRegressor | Supervised regressor and classifier | | RandomForestEstimator | RandomForestClassifier, RandomForestRegressor | Supervised regressor and classifier | | NeuralNetworkEstimator | NeuralNetworkRegressor, MultitargetNeuralNetworkRegressor, NeuralNetworkClassifier | Supervised regressor and classifier | | GaussianMixtureRegressor | GaussianMixtureRegressor, MultitargetGaussianMixtureRegressor | Supervised regressor | | GaussianMixtureRegressor2 | | Supervised regressor | | KMeansClusterer | KMeansClusterer | Unsupervised hard clusterer | | KMedoidsClusterer | KMedoidsClusterer | Unsupervised hard clusterer | | GaussianMixtureClusterer | GaussianMixtureClusterer | Unsupervised soft clusterer | | SimpleImputer| SimpleImputer | Unsupervised missing data imputer | | GaussianMixtureImputer | GaussianMixtureImputer | Unsupervised missing data imputer | | RandomForestImputer | RandomForestImputer | Unsupervised missing data imputer | | GeneralImputer | GeneralImputer | Unsupervised missing data imputer | | MinMaxScaler | | Data transformer | | StandardScaler | | Data transformer | | Scaler | | Data transformer | | PCAEncoder | | Unsupervised dimensionality reduction transformer | | AutoEncoder | AutoEncoderMLJ | Unsupervised non-linear dimensionality reduction | | OneHotEncoder | | Data transformer | | OrdinalEncoder | | Data transformer | | ConfusionMatrix | | Predictions analysis | | FeatureRanker | | Predictions analysis |

Theoretical notes describing many of these algorithms are at the companion repository https://github.com/sylvaticus/MITx_6.86x.

All models are implemented entirely in Julia and are hosted in the repository itself (i.e. they are not wrapper to third-party models). If your favorite option or model is missing, you can try implement it yourself and open a pull request to share it (see the section Contribute below) or request its implementation (open an issue). Thanks to its JIT compiler, Julia is indeed in the sweet spot where we can easily write models in a high-level language and still having them running efficiently.

Documentation

Please refer to the package documentation or use the Julia inline package system (just press the question mark ? and then, on the special help prompt help?>, type the module or function name). The package documentation is made of two distinct parts. The first one is an extensively commented tutorial that covers most of the library, the second one is the reference manual covering the library's API.

If you are looking for an introductory material on Julia, have a look on the book "Julia Quick Syntax Reference"(Apress,2019) or the online course "Scientific Programming and Machine Learning in Julia.

While implemented in Julia, this package can be easily used in R or Python employing R's JuliaCall or Python's juliacall respectively, see the relevant section in the documentation.

Examples

Using an Artificial Neural Network for multinomial categorisation

In this example we see how to train a neural networks model to predict the specie's name (5th column) given floral sepals and petals measures (first 4 columns) in the famous iris flower dataset.

```julia

Load Modules

using DelimitedFiles, Random using Pipe, Plots, BetaML # Load BetaML and other auxiliary modules Random.seed!(123); # Fix the random seed (to obtain reproducible results).

Load the data

iris = readdlm(joinpath(dirname(Base.find_package("BetaML")),"..","test","data","iris.csv"),',',skipstart=1) x = convert(Array{Float64,2}, iris[:,1:4]) y = convert(Array{String,1}, iris[:,5])

Encode the categories (levels) of y using a separate column per each category (aka "one-hot" encoding)

ohmod = OneHotEncoder() y_oh = fit!(ohmod,y)

Split the data in training/testing sets

((xtrain,xtest),(ytrain,ytest),(ytrainoh,ytestoh)) = partition([x,y,y_oh],[0.8,0.2]) (ntrain, ntest) = size.([xtrain,xtest],1)

Define the Artificial Neural Network model

l1 = DenseLayer(4,10,f=relu) # The activation function is ReLU l2 = DenseLayer(10,3) # The activation function is identity by default l3 = VectorFunctionLayer(3,f=softmax) # Add a (parameterless) layer whose activation function (softmax in this case) is defined to all its nodes at once mynn = NeuralNetworkEstimator(layers=[l1,l2,l3],loss=crossentropy,descr="Multinomial logistic regression Model Sepal", batch_size=2, epochs=200) # Build the NN and use the cross-entropy as error function. Swith to auto-tuning with autotune=true

Train the model (using the ADAM optimizer by default)

res = fit!(mynn,fit!(Scaler(),xtrain),ytrain_oh) # Fit the model to the (scaled) data

Obtain predictions and test them against the ground true observations

ŷtrain = @pipe predict(mynn,fit!(Scaler(),xtrain)) |> inversepredict(ohmod,) # Note the scaling and reverse one-hot encoding functions ŷtest = @pipe predict(mynn,fit!(Scaler(),xtest)) |> inversepredict(ohmod,) trainaccuracy = accuracy(ytrain,ŷtrain) # 0.975 testaccuracy = accuracy(ytest,ŷtest) # 0.96

Analyse model performances

cm = ConfusionMatrix() fit!(cm,ytest,ŷtest) print(cm) text A ConfusionMatrix BetaMLModel (fitted)

*** CONFUSION MATRIX ***

Scores actual (rows) vs predicted (columns):

4×4 Matrix{Any}: "Labels" "virginica" "versicolor" "setosa" "virginica" 8 1 0 "versicolor" 0 14 0 "setosa" 0 0 7 Normalised scores actual (rows) vs predicted (columns):

4×4 Matrix{Any}: "Labels" "virginica" "versicolor" "setosa" "virginica" 0.888889 0.111111 0.0 "versicolor" 0.0 1.0 0.0 "setosa" 0.0 0.0 1.0

*** CONFUSION REPORT ***

Accuracy: 0.9666666666666667
Misclassification rate: 0.033333333333333326
Number of classes: 3

N Class precision recall specificity f1score actualcount predictedcount TPR TNR support

1 virginica 1.000 0.889 1.000 0.941 9 8 2 versicolor 0.933 1.000 0.938 0.966 14 15 3 setosa 1.000 1.000 1.000 1.000 7 7

Simple avg. 0.978 0.963 0.979 0.969
Weigthed avg. 0.969 0.967 0.971 0.966 ```

julia ϵ = info(mynn)["loss_per_epoch"] plot(1:length(ϵ),ϵ, ylabel="epochs",xlabel="error",legend=nothing,title="Avg. error per epoch on the Sepal dataset") heatmap(info(cm)["categories"],info(cm)["categories"],info(cm)["normalised_scores"],c=cgrad([:white,:blue]),xlabel="Predicted",ylabel="Actual", title="Confusion Matrix")

Other examples

Further examples, with more models and more advanced techniques in order to improve predictions, are provided in the documentation tutorial. Basic examples in Python and R are given here. Very "micro" examples of usage of the various functions can also be studied in the unit-tests available in the test folder.

Limitations and alternative packages

The focus of the library is skewed toward user-friendliness rather than computational efficiency. While the code is (relatively) easy to read, it is not heavily optimised, and currently all models operate on the CPU and only with data that fits in the pc's memory. For very large data we suggest specialised packages. See the list below:

Category | Packages -----------------|----------------- ML toolkits/pipelines | ScikitLearn.jl, AutoMLPipeline.jl, MLJ.jl Neural Networks | Flux.jl, Knet Decision Trees | DecisionTree.jl Clustering | Clustering.jl, GaussianMixtures.jl Missing imputation | Impute.jl, Mice.jl Variable importance | ShapML.jl

TODO

Short term

Implement autotuning of GaussianMixtureClusterer using BIC or AIC
~~Add Silhouette method to check cluster validity~~
Implement PAM and/or variants for kmedoids

Mid/Long term

Add RNN support and improve convolutional layers speed
Reinforcement learning (Markov decision processes)
Standardize data sampling in training
Add GPU

Contribute

Contributions to the library are welcome. We are particularly interested in the areas covered in the "TODO" list above, but we are open to other areas as well. Please however consider that the focus is mostly didactic/research, so clear, easy to read (and well documented) code and simple API with reasonable defaults are more important that highly optimised algorithms. For the same reason, it is fine to use verbose names. Please open an issue to discuss your ideas or make directly a well-documented pull request to the repository. While not required by any means, if you are customising BetaML and writing for example your own neural network layer type (by subclassing AbstractLayer), your own sampler (by subclassing AbstractDataSampler) or your own mixture component (by subclassing AbstractMixture), please consider to give it back to the community and open a pull request to integrate them in BetaML.

Citations

If you use BetaML please cite it as:

Lobianco, A., (2021). BetaML: The Beta Machine Learning Toolkit, a self-contained repository of Machine Learning algorithms in Julia. Journal of Open Source Software, 6(60), 2849, https://doi.org/10.21105/joss.02849

Bibtex @article{Lobianco2021, doi = {10.21105/joss.02849}, url = {https://doi.org/10.21105/joss.02849}, year = {2021}, publisher = {The Open Journal}, volume = {6}, number = {60}, pages = {2849}, author = {Antonello Lobianco}, title = {BetaML: The Beta Machine Learning Toolkit, a self-contained repository of Machine Learning algorithms in Julia}, journal = {Journal of Open Source Software} }

Acknowledgements

The development of this package at the Bureau d'Economie Théorique et Appliquée (BETA, Nancy) was supported by the French National Research Agency through the Laboratory of Excellence ARBRE, a part of the “Investissements d'Avenir” Program (ANR 11 – LABX-0002-01).

Owner

Name: Antonello Lobianco
Login: sylvaticus
Kind: user
Location: Nancy, France
Company: AgroParisTech

Website: https://lobianco.org/antonello
Repositories: 10
Profile: https://github.com/sylvaticus

Research Engineer in Forest Economics at AgroParisTech / Bureau d'Economie Théorique et Appliquée

JOSS Publication

BetaML: The Beta Machine Learning Toolkit, a self-contained repository of Machine Learning algorithms in Julia

Published

April 30, 2021

DOI

10.21105/joss.02849

Volume 6, Issue 60, Page 2849

Authors

Antonello Lobianco

Université de Lorraine, Université de Strasbourg, Institut des sciences et industries du vivant et de l'environnement (AgroParisTech), Centre national de la recherche scientifique (CNRS), Institut national de recherche pour l’agriculture, l’alimentation et l’environnement (INRAE), Bureau d'économie théorique et appliquée (BETA)

Editor

Yuan Tang

GitHub Events

Total

Create event: 3
Commit comment event: 2
Release event: 1
Issues event: 2
Watch event: 8
Issue comment event: 4
Push event: 14
Pull request event: 4

Last Year

Create event: 3
Commit comment event: 2
Release event: 1
Issues event: 2
Watch event: 8
Issue comment event: 4
Push event: 14
Pull request event: 4

Committers

Last synced: 7 months ago

All Time

Total Commits: 619
Total Committers: 8
Avg Commits per committer: 77.375
Development Distribution Score (DDS): 0.053

Past Year

Commits: 11
Committers: 2
Avg Commits per committer: 5.5
Development Distribution Score (DDS): 0.182

Top Committers

Name	Email	Commits
Antonello Lobianco	a**o@l**g	586
github-actions[bot]	4****]	9
Roland Schätzle	8****A	7
CompatHelper Julia	c**y@j**g	7
Páll Haraldsson	P**n@g**m	6
Antonello Lobianco	l**o@l**g	2
Rik Huijzer	t**r@r**l	1
Arfon Smith	a****n	1

Committer Domains (Top 20 + Academic)

lobianco.org: 2 rug.nl: 1 julialang.org: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 44
Total pull requests: 34
Average time to close issues: about 2 months
Average time to close pull requests: 25 days
Total issue authors: 13
Total pull request authors: 5
Average comments per issue: 3.82
Average comments per pull request: 1.44
Merged pull requests: 26
Bot issues: 0
Bot pull requests: 24

Past Year

Issues: 1
Pull requests: 4
Average time to close issues: 24 days
Average time to close pull requests: 3 months
Issue authors: 1
Pull request authors: 1
Average comments per issue: 2.0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 4

View more stats

Top Authors

Issue Authors

ablaom (20)
sylvaticus (8)
PallHaraldsson (2)
ParadaCarleton (2)
juliohm (2)
vincent-picaud (1)
mlesnoff (1)
simonsteiger (1)
CasBex (1)
rubsc (1)
ConnectedSystems (1)
JuliaTagBot (1)
logankilpatrick (1)

Pull Request Authors

github-actions[bot] (25)
roland-KA (5)
PallHaraldsson (4)
arfon (1)
rikhuijzer (1)

Top Labels

Issue Labels

bug (2) question (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- julia 43 total

Total dependent packages: 1
Total dependent repositories: 0
Total versions: 40

juliahub.com: BetaML

Beta Machine Learning Toolkit

Documentation: https://docs.juliahub.com/General/BetaML/stable/
License: MIT
Latest release: 0.12.2
published 10 months ago

Versions: 40
Dependent Packages: 1
Dependent Repositories: 0
Downloads: 43 Total

Rankings

Stargazers count: 9.6%

Dependent repos count: 9.9%

Forks count: 11.7%

Average: 17.5%

Dependent packages count: 38.9%

Last synced: 6 months ago

Dependencies

.github/workflows/TagBot.yml actions

JuliaRegistries/TagBot v1 composite

.github/workflows/binder.yaml actions

jupyterhub/repo2docker-action master composite

.github/workflows/ci-nightly.yml actions

actions/cache v1 composite
actions/checkout v2 composite
codecov/codecov-action v1 composite
julia-actions/julia-buildpkg v1 composite
julia-actions/julia-processcoverage v1 composite
julia-actions/julia-runtest v1 composite
julia-actions/setup-julia v1 composite

.github/workflows/ci.yml actions

actions/cache v1 composite
actions/checkout v2 composite
codecov/codecov-action v1 composite
julia-actions/julia-buildpkg v1 composite
julia-actions/julia-processcoverage v1 composite
julia-actions/julia-runtest v1 composite
julia-actions/setup-julia v1 composite

.github/workflows/CompatHelper.yml actions

BetaML

Science Score: 95.0%

Keywords

Keywords from Contributors

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Beta Machine Learning Toolkit

Documentation

Examples

Load Modules

Load the data

Encode the categories (levels) of y using a separate column per each category (aka "one-hot" encoding)

Split the data in training/testing sets

Define the Artificial Neural Network model

Train the model (using the ADAM optimizer by default)

Obtain predictions and test them against the ground true observations

Analyse model performances

Limitations and alternative packages

TODO

Short term

Mid/Long term

Contribute

Citations

Acknowledgements

Owner

JOSS Publication

BetaML: The Beta Machine Learning Toolkit, a self-contained repository of Machine Learning algorithms in Julia

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

juliahub.com: BetaML

Rankings

Dependencies