frouros

Frouros: an open-source Python library for drift detection in machine learning systems.

https://github.com/ifca-advanced-computing/frouros

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 64 DOI reference(s) in README
✓
Academic publication links
Links to: researchgate.net, sciencedirect.com, acm.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary

Keywords

change-detection concept-drift covariate-shift data-drift dataset-drift dataset-shift distribution-shift drift-detection machine-learning machine-learning-engineering machine-learning-operations mle mlops python statistics

Keywords from Contributors

mesh interactive

Last synced: 10 months ago · JSON representation ·

Repository

Frouros: an open-source Python library for drift detection in machine learning systems.

Basic Info

Host: GitHub
Owner: IFCA-Advanced-Computing
License: bsd-3-clause
Language: Python
Default Branch: main
Homepage: https://frouros.readthedocs.io
Size: 22.3 MB

Statistics

Stars: 224
Watchers: 4
Forks: 17
Open Issues: 17
Releases: 22

Topics

Created over 4 years ago · Last pushed 11 months ago

Metadata Files

Readme Contributing License Code of conduct Citation Codeowners

README.md

Frouros is a Python library for drift detection in machine learning systems that provides a combination of classical and more recent algorithms for both concept and data drift detection.

"Everything changes and nothing stands still"

"You could not step twice into the same river"

Heraclitus of Ephesus (535-475 BCE.)

⚡️ Quickstart

🔄 Concept drift

As a quick example, we can use the breast cancer dataset to which concept drift it is induced and show the use of a concept drift detector like DDM (Drift Detection Method). We can see how concept drift affects the performance in terms of accuracy.

```python import numpy as np from sklearn.datasets import loadbreastcancer from sklearn.linearmodel import LogisticRegression from sklearn.modelselection import traintestsplit from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler

from frouros.detectors.concept_drift import DDM, DDMConfig from frouros.metrics import PrequentialError

np.random.seed(seed=31)

Load breast cancer dataset

X, y = loadbreastcancer(returnXy=True)

Split train (70%) and test (30%)

( Xtrain, Xtest, ytrain, ytest, ) = traintestsplit(X, y, trainsize=0.7, randomstate=31)

Define and fit model

pipeline = Pipeline( [ ("scaler", StandardScaler()), ("model", LogisticRegression()), ] ) pipeline.fit(X=Xtrain, y=ytrain)

Detector configuration and instantiation

config = DDMConfig( warninglevel=2.0, driftlevel=3.0, minnuminstances=25, # minimum number of instances before checking for concept drift ) detector = DDM(config=config)

Metric to compute accuracy

metric = PrequentialError(alpha=1.0) # alpha=1.0 is equivalent to normal accuracy

def streamtest(Xtest, ytest, y, metric, detector): """Simulate data stream over Xtest and ytest. y is the true label.""" driftflag = False for i, (X, y) in enumerate(zip(Xtest, ytest)): ypred = pipeline.predict(X.reshape(1, -1)) error = 1 - (ypred.item() == y.item()) metricerror = metric(errorvalue=error) _ = detector.update(value=error) status = detector.status if status["drift"] and not driftflag: driftflag = True print(f"Concept drift detected at step {i}. Accuracy: {1 - metricerror:.4f}") if not driftflag: print("No concept drift detected") print(f"Final accuracy: {1 - metric_error:.4f}\n")

Simulate data stream (assuming test label available after each prediction)

No concept drift is expected to occur

streamtest( Xtest=Xtest, ytest=y_test, y=y, metric=metric, detector=detector, )

>> No concept drift detected

>> Final accuracy: 0.9766

IMPORTANT: Induce/simulate concept drift in the last part (20%)

of y_test by modifying some labels (50% approx). Therefore, changing P(y|X))

driftsize = int(ytest.shape[0] * 0.2) ytestdrift = ytest[-driftsize:] modifyidx = np.random.rand(*ytestdrift.shape) <= 0.5 ytestdrift[modifyidx] = (ytestdrift[modifyidx] + 1) % len(np.unique(ytest)) ytest[-driftsize:] = ytestdrift

Reset detector and metric

detector.reset() metric.reset()

Simulate data stream (assuming test label available after each prediction)

Concept drift is expected to occur because of the label modification

streamtest( Xtest=Xtest, ytest=y_test, y=y, metric=metric, detector=detector, )

>> Concept drift detected at step 142. Accuracy: 0.9510

>> Final accuracy: 0.8480

```

More concept drift examples can be found here.

📊 Data drift

As a quick example, we can use the iris dataset to which data drift is induced and show the use of a data drift detector like Kolmogorov-Smirnov test.

```python import numpy as np from sklearn.datasets import loadiris from sklearn.modelselection import traintestsplit from sklearn.tree import DecisionTreeClassifier

from frouros.detectors.data_drift import KSTest

np.random.seed(seed=31)

Load iris dataset

X, y = loadiris(returnX_y=True)

Split train (70%) and test (30%)

( Xtrain, Xtest, ytrain, ytest, ) = traintestsplit(X, y, trainsize=0.7, randomstate=31)

Set the feature index to which detector is applied

feature_idx = 0

IMPORTANT: Induce/simulate data drift in the selected feature of y_test by

applying some gaussian noise. Therefore, changing P(X))

Xtest[:, featureidx] += np.random.normal( loc=0.0, scale=3.0, size=X_test.shape[0], )

Define and fit model

model = DecisionTreeClassifier(randomstate=31) model.fit(X=Xtrain, y=y_train)

Set significance level for hypothesis testing

alpha = 0.001

Define and fit detector

detector = KSTest() _ = detector.fit(X=Xtrain[:, featureidx])

Apply detector to the selected feature of X_test

result, _ = detector.compare(X=Xtest[:, featureidx])

Check if drift is taking place

if result.pvalue <= alpha: print(f"Data drift detected at feature {featureidx}") else: print(f"No data drift detected at feature {feature_idx}")

>> Data drift detected at feature 0

Therefore, we can reject H0 (both samples come from the same distribution).

```

More data drift examples can be found here.

🛠 Installation

Frouros can be installed via pip:

bash pip install frouros

🕵🏻‍♂️️ Drift detection methods

The currently implemented detectors are listed in the following table.

Drift detector	Type	Family	Univariate (U) / Multivariate (M)	Numerical (N) / Categorical (C)	Method	Reference
Concept drift	Streaming	Change detection	U	N	BOCD	Adams and MacKay (2007)
			U	N	CUSUM	Page (1954)
			U	N	Geometric moving average	Roberts (1959)
			U	N	Page Hinkley	Page (1954)
		Statistical process control	U	N	DDM	Gama et al. (2004)
			U	N	ECDD-WT	Ross et al. (2012)
			U	N	EDDM	Baena-Garcıa et al. (2006)
			U	N	HDDM-A	Frias-Blanco et al. (2014)
			U	N	HDDM-W	Frias-Blanco et al. (2014)
			U	N	RDDM	Barros et al. (2017)
		Window based	U	N	ADWIN	Bifet and Gavalda (2007)
			U	N	KSWIN	Raab et al. (2020)
			U	N	STEPD	Nishida and Yamauchi (2007)
Data drift	Batch	Distance based	U	N	Bhattacharyya distance	Bhattacharyya (1946)
			U	N	Earth Mover's distance	Rubner et al. (2000)
			U	N	Energy distance	Székely et al. (2013)
			U	N	Hellinger distance	Hellinger (1909)
			U	N	Histogram intersection normalized complement	Swain and Ballard (1991)
			U	N	Jensen-Shannon distance	Lin (1991)
			U	N	Kullback-Leibler divergence	Kullback and Leibler (1951)
			M	N	Maximum Mean Discrepancy	Gretton et al. (2012)
			U	N	Population Stability Index	Wu and Olson (2010)
		Statistical test	U	N	Anderson-Darling test	Scholz and Stephens (1987)
			U	N	Baumgartner-Weiss-Schindler test	Baumgartner et al. (1998)
			U	C	Chi-square test	Pearson (1900)
			U	N	Cramér-von Mises test	Cramér (1902)
			U	N	Kolmogorov-Smirnov test	Massey Jr (1951)
			U	N	Kuiper's test	Kuiper (1960)
			U	N	Mann-Whitney U test	Mann and Whitney (1947)
			U	N	Welch's t-test	Welch (1947)
	Streaming	Distance based	M	N	Maximum Mean Discrepancy	Gretton et al. (2012)
	Streaming	Statistical test	U	N	Incremental Kolmogorov-Smirnov test	dos Reis et al. (2016)

❗ What is and what is not Frouros?

Unlike other libraries that in addition to provide drift detection algorithms, include other functionalities such as anomaly/outlier detection, adversarial detection, imbalance learning, among others, Frouros has and will ONLY have one purpose: drift detection.

We firmly believe that machine learning related libraries or frameworks should not follow Jack of all trades, master of none principle. Instead, they should be focused on a single task and do it well.

✅ Who is using Frouros?

Frouros is actively being used by the following projects to implement drift detection in machine learning pipelines:

AI4EOSC.
iMagine.

If you want your project listed here, do not hesitate to send us a pull request.

👍 Contributing

Check out the contribution section.

💬 Citation

If you want to cite Frouros you can use the SoftwareX publication.

bibtex @article{CESPEDESSISNIEGA2024101733, title = {Frouros: An open-source Python library for drift detection in machine learning systems}, journal = {SoftwareX}, volume = {26}, pages = {101733}, year = {2024}, issn = {2352-7110}, doi = {https://doi.org/10.1016/j.softx.2024.101733}, url = {https://www.sciencedirect.com/science/article/pii/S2352711024001043}, author = {Jaime {Céspedes Sisniega} and Álvaro {López García}}, keywords = {Machine learning, Drift detection, Concept drift, Data drift, Python}, abstract = {Frouros is an open-source Python library capable of detecting drift in machine learning systems. It provides a combination of classical and more recent algorithms for drift detection, covering both concept and data drift. We have designed it to be compatible with any machine learning framework and easily adaptable to real-world use cases. The library is developed following best development and continuous integration practices to ensure ease of maintenance and extensibility.} }

📝 License

Frouros is an open-source software licensed under the BSD-3-Clause license.

🙏 Acknowledgements

Frouros has received funding from the Agencia Estatal de Investigación, Unidad de Excelencia María de Maeztu, ref. MDM-2017-0765.

Owner

Name: IFCA Advanced Computing and e-Science group
Login: IFCA-Advanced-Computing
Kind: organization
Location: Santander, Spain

Website: http://computing.ifca.es/
Twitter: IFCA_Computing
Repositories: 56
Profile: https://github.com/IFCA-Advanced-Computing

Citation (CITATION.cff)

cff-version: 1.2.0
title: >-
  Frouros: An open-source Python library for drift detection
  in machine learning systems
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Jaime
    family-names: Céspedes Sisniega
    email: cespedes@ifca.unican.es
    orcid: 'https://orcid.org/0000-0002-6010-1212'
    affiliation: >-
      Institute of Physics of Cantabria, Spanish National
      Research Council — IFCA (CSIC—UC)
  - given-names: Álvaro
    family-names: López García
    email: aloga@ifca.unican.es
    orcid: 'https://orcid.org/0000-0002-0013-4602'
    affiliation: >-
      Institute of Physics of Cantabria, Spanish National
      Research Council — IFCA (CSIC—UC)
identifiers:
  - type: doi
    value: 10.1016/j.softx.2024.101733
    description: SoftwareX
  - type: doi
    value: 10.48550/arXiv.2208.06868
    description: arXiv
repository-code: 'https://github.com/IFCA-Advanced-Computing/frouros'
url: 'https://frouros.readthedocs.io'
repository: 'https://github.com/ElsevierSoftwareX/SOFTX-D-24-00119'
repository-artifact: 'https://pypi.org/project/frouros'
abstract: >-
  Frouros is an open-source Python library capable of detecting drift in machine learning systems. It provides a combination of classical and more recent algorithms for drift detection, covering both concept and data drift. We have designed it to be compatible with any machine learning framework and easily adaptable to real-world use cases. The library is developed following best development and continuous integration practices to ensure ease of maintenance and extensibility.
keywords:
  - Machine learning
  - Drift detection
  - Concept drift
  - Data drift
  - Python
license: BSD-3-Clause
commit: 4e1e27ee73507b15090f0038d8dda7c67485b728
version: 0.8.0
date-released: '2024-04-03'

GitHub Events

Total

Issues event: 1
Watch event: 37
Delete event: 11
Issue comment event: 9
Push event: 19
Pull request event: 33
Pull request review event: 8
Fork event: 3
Create event: 20

Last Year

Issues event: 1
Watch event: 37
Delete event: 11
Issue comment event: 9
Push event: 19
Pull request event: 33
Pull request review event: 8
Fork event: 3
Create event: 20

Committers

Last synced: over 2 years ago

All Time

Total Commits: 646
Total Committers: 5
Avg Commits per committer: 129.2
Development Distribution Score (DDS): 0.054

Past Year

Commits: 388
Committers: 4
Avg Commits per committer: 97.0
Development Distribution Score (DDS): 0.082

Top Committers

Name	Email	Commits
Jaime Céspedes Sisniega	j**a@g**m	611
dependabot[bot]	4****]	23
Jaime Céspedes Sisniega	c**s@i**s	6
Alvaro Lopez Garcia	a**a@i**s	4
Jaime Céspedes Sisniega	7****a	2

Committer Domains (Top 20 + Academic)

ifca.unican.es: 2

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 12
Total pull requests: 174
Average time to close issues: 6 months
Average time to close pull requests: 12 days
Total issue authors: 2
Total pull request authors: 3
Average comments per issue: 0.75
Average comments per pull request: 0.09
Merged pull requests: 159
Bot issues: 0
Bot pull requests: 61

Past Year

Issues: 0
Pull requests: 31
Average time to close issues: N/A
Average time to close pull requests: 3 months
Issue authors: 0
Pull request authors: 3
Average comments per issue: 0
Average comments per pull request: 0.39
Merged pull requests: 17
Bot issues: 0
Bot pull requests: 24

View more stats

Top Authors

Issue Authors

jaime-cespedes-sisniega (11)
Tiffany-TW (1)

Pull Request Authors

jaime-cespedes-sisniega (136)
dependabot[bot] (92)
MarcBresson (1)

Top Labels

Issue Labels

bug (7) enhancement (3) management (2) dataset (1) utils (1) needs triage (1)

Pull Request Labels

dependencies (93) bug (58) management (37) enhancement (32) documentation (18) notebook (14) ci (13) callbacks (6) detector (5) test (3) utils (2) ci/cd (2) dataset (1)

Packages

Total packages: 1
Total downloads:
- pypi 4,531 last-month

Total dependent packages: 0
Total dependent repositories: 2
Total versions: 22
Total maintainers: 2

pypi.org: frouros

An open-source Python library for drift detection in machine learning systems

Homepage: https://github.com/IFCA-Advanced-Computing/frouros
Documentation: https://frouros.readthedocs.io/
License: BSD-3-Clause
Latest release: 0.9.0
published almost 2 years ago

Versions: 22
Dependent Packages: 0
Dependent Repositories: 2
Downloads: 4,531 Last month

Rankings

Stargazers count: 6.6%

Dependent packages count: 10.1%

Average: 10.7%

Forks count: 11.4%

Dependent repos count: 11.6%

Downloads: 14.0%

Maintainers (2)

aloga jaime-cespedes-sisniega

Last synced: 11 months ago

Dependencies

.github/workflows/ci.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

.github/workflows/code_coverage.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite
codecov/codecov-action v3 composite

.github/workflows/documentation.yml actions

readthedocs/actions/preview v1 composite

.github/workflows/publish.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite
pypa/gh-action-pypi-publish release/v1 composite

pyproject.toml pypi

matplotlib >=3.6.0,<3.8
numpy >=1.24.0,<1.26
requests >=2.31.0,<2.32
scipy >=1.10.0,<1.11
tqdm >=4.65,<5

setup.py pypi