ondil

A package for online distributional learning.

https://github.com/simon-hirsch/ondil

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.9%) to scientific vocabulary

Keywords

distributional-forecasting distributional-regression gam gamlss location-scale-and-shape machine-learning online-learning probabilistic-forecasting python statistical-learning volatility-modeling

Scientific Fields

Economics Social Sciences - 60% confidence
Earth and Environmental Sciences Physical Sciences - 40% confidence
Last synced: 4 months ago · JSON representation ·

Repository

A package for online distributional learning.

Basic Info
Statistics
  • Stars: 22
  • Watchers: 2
  • Forks: 6
  • Open Issues: 32
  • Releases: 19
Topics
distributional-forecasting distributional-regression gam gamlss location-scale-and-shape machine-learning online-learning probabilistic-forecasting python statistical-learning volatility-modeling
Created over 1 year ago · Last pushed 4 months ago
Metadata Files
Readme License Citation

README.md

ondil: Online Distributional Learning

Open Source Love License GitHub Release Downloads Tests Docs

Introduction

This package provides an online estimation of distributional regression The main contribution is an online/incremental implementation of the generalized additive models for location, shape and scale (GAMLSS, see Rigby & Stasinopoulos, 2005) developed in Hirsch, Berrisch & Ziel, 2024.

Please have a look at the documentation or the example notebook.

We're actively working on the package and welcome contributions from the community. Have a look at the Release Notes and the Issue Tracker.

Distributional Regression

The main idea of distributional regression (or regression beyond the mean, multiparameter regression) is that the response variable $Y$ is distributed according to a specified distribution $\mathcal{F}(\theta)$, where $\theta$ is the parameter vector for the distribution. In the Gaussian case, we have $\theta = (\theta1, \theta2) = (\mu, \sigma)$. We then specify an individual regression model for all parameters of the distribution of the form

$$gk(\thetak) = \etak = Xk\beta_k$$

where $gk(\cdot)$ is a link function, which ensures that the predicted distribution parameters are in a sensible range (we don't want, e.g. negative standard deviations), and $\etak$ is the predictor. For the Gaussian case, this would imply that we have two regression equations, one for the mean (location) and one for the standard deviation (scale) parameters. Distributions other than the normal distribution are possible, and we have already implemented them, e.g., Student's $t$-distribution and Johnson's $S_U$ distribution. If you are interested in another distribution, please open an Issue.

This allows us to specify very flexible models that consider the conditional behaviour of the variable's volatility, skewness and tail behaviour. A simple example for electricity markets is wind forecasts, which are skewed depending on the production level - intuitively, there is a higher risk of having lower production if the production level is already high since it cannot go much higher than "full load" and if, the turbines might cut-off. Modelling these conditional probabilistic behaviours is the key strength of distributional regression models.

Features

  • 🚀 First native Python implementation of generalized additive models for location, shape and scale (GAMLSS).
  • 🚀 Online-first approach, which allows for incremental updates of the model using model.update(X, y).
  • 🚀 Support for various distributions, including Gaussian, Student's $t$, Johnson's $S_U$, Gamma, Log-normal, Exponential, Beta, Gumbel, Inverse Gaussian and more. Implementing new distributions is straight-forward.
  • 🚀 Flexible link functions for each distribution, allowing for custom transformations of the parameters.
  • 🚀 Support for regularization methods like Lasso, Ridge and Elastic Net.
  • 🚀 Fast and efficient implementation using numba for just-in-time compilation.
  • 🚀 Full compatibility with scikit-learn estimators and transformers.

Example

Basic estimation and updating procedure:

```python import ondil import numpy as np from sklearn.datasets import load_diabetes

X, y = loaddiabetes(returnX_y=True)

Model coefficients

equation = { 0 : "all", # Can also use "intercept" or np.ndarray of integers / booleans 1 : "all", 2 : "all", }

Create the estimator

onlinegamlsslasso = ondil.estimators.OnlineDistributionalRegression( distribution=ondil.StudentT(), method="lasso", equation=equation, fit_intercept=True, ic="bic", )

Initial Fit

onlinegamlsslasso.fit( X=X[:-11, :], y=y[:-11], ) print("Coefficients for the first N-11 observations \n") print(onlinegamlsslasso.beta)

Update call

onlinegamlsslasso.update( X=X[[-11], :], y=y[[-11]] ) print("\nCoefficients after update call \n") print(onlinegamlsslasso.beta)

Prediction for the last 10 observations

prediction = onlinegamlsslasso.predictdistributionparameters( X=X[-10:, :] )

print("\n Predictions for the last 10 observations")

Location, scale and shape (degrees of freedom)

print(prediction) ```

Installation & Dependencies

The package is available from pypi - do pip install ondil and enjoy.

ondil is designed to have minimal dependencies. We rely on python>=3.10, numpy, numba and scipy in a reasonably up-to-date versions.

Authors

  • Simon Hirsch, University of Duisburg-Essen & Statkraft
  • Jonathan Berrisch, University of Duisburg-Essen
  • Florian Ziel, University of Duisburg-Essen

I was looking for rolch but I found ondil?

rolch (Regularized Online Learning for Conditional Heteroskedasticity) was the original name of this package, but we decided to rename it to ondil (Online Distributional Learning) to better reflect its purpose and functionality, since conditional heteroskedasticity (=non constant variance) is just one of the many applications for distributional regression models that can be estimated with this package.

Contributing

We welcome every contribution from the community. Feel free to open an issue if you find bugs or want to propose changes.

We're still in an early phase and welcome feedback, especially on the usability and "look and feel" of the package. Secondly, we're working to port distributions from the R-GAMLSS package and welcome according PRs.

To get started, just create a fork and get going. We will modularize the code over the next versions and increase our testing coverage. We use ruff and black as formatters.

Acknowledgements & Disclosure

Simon is employed at Statkraft and gratefully acknowledges support received from Statkraft for his PhD studies. This work contains the author's opinion and not necessarily reflects Statkraft's position.

Install from Source

1) Clone this repo. 2) Install the necessary dependencies from the requirements.txt using conda create --name <env> --file requirements.txt. 3) Run pip install . optionally using --force or --force --no-deps to ensure the package is build from the updated wheels. If you want to 100% sure no cached wheels are there or you need the tarball, run python -m build before installing. 4) Enjoy.

Owner

  • Login: simon-hirsch
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this package, please cite the working paper.
authors:
  - family-names: Hirsch
    given-names: Simon
  - family-names: Berrisch
    given-names: Jonathan
  - family-names: Ziel
    given-names: Florian
title: ondil
version: 0.2.2
date-released: '2024-02-27'
preferred-citation:
  authors:
    - family-names: Hirsch
      given-names: Simon
    - family-names: Berrisch
      given-names: Jonathan
    - family-names: Ziel
      given-names: Florian
  title: Online Distributional Regression
  type: article
  year: '2024'
  journal: "arXiv preprint arXiv:2407.08750"

GitHub Events

Total
  • Create event: 16
  • Release event: 3
  • Issues event: 17
  • Watch event: 5
  • Delete event: 10
  • Issue comment event: 46
  • Push event: 107
  • Pull request review event: 80
  • Pull request review comment event: 61
  • Pull request event: 35
Last Year
  • Create event: 16
  • Release event: 3
  • Issues event: 17
  • Watch event: 5
  • Delete event: 10
  • Issue comment event: 46
  • Push event: 107
  • Pull request review event: 80
  • Pull request review comment event: 61
  • Pull request event: 35

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 0
  • Total pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 30 days
  • Total issue authors: 0
  • Total pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 1.5
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 30 days
  • Issue authors: 0
  • Pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 1.5
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • simon-hirsch (10)
  • BerriJ (3)
  • murthy-econometrics-5819 (1)
Pull Request Authors
  • simon-hirsch (10)
  • BerriJ (5)
  • murthy-econometrics-5819 (3)
  • Copilot (2)
  • flziel (1)
Top Labels
Issue Labels
distributions (5) enhancement (3) sklearn-comp (2) documentation (1) good first issue (1) estimators (1)
Pull Request Labels
enhancement (3) testing (2) estimators (1) sklearn-comp (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 27 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 4
  • Total maintainers: 1
pypi.org: ondil

Methods for online / incremental estimation of distributional regression models

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 27 Last month
Rankings
Dependent packages count: 9.1%
Average: 30.1%
Dependent repos count: 51.1%
Maintainers (1)
Last synced: 4 months ago