stream-learn

The stream-learn is an open-source Python library for difficult data stream analysis.

https://github.com/w4k2/stream-learn

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
○
Academic publication links
✓
Committers with academic emails
3 of 13 committers (23.1%) from academic institutions
✓
Institutional organization owner
Organization w4k2 has institutional domain (kssk.pwr.edu.pl)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.8%) to scientific vocabulary

Keywords

data-streams machine-learning python software

Last synced: 6 months ago · JSON representation

Repository

The stream-learn is an open-source Python library for difficult data stream analysis.

Basic Info

Host: GitHub
Owner: w4k2
License: gpl-3.0
Language: Python
Default Branch: master
Homepage: https://stream-learn.readthedocs.io
Size: 92.7 MB

Statistics

Stars: 63
Watchers: 8
Forks: 20
Open Issues: 4
Releases: 1

Topics

data-streams machine-learning python software

Created about 8 years ago · Last pushed 6 months ago

Metadata Files

Readme Changelog License

stream-learn

The stream-learn module is a set of tools necessary for processing data streams using scikit-learn estimators. The batch processing approach is used here, where the dataset is passed to the classifier in smaller, consecutive subsets called chunks. The module consists of five sub-modules:

streams - containing a data stream generator that allows obtaining both stationary and dynamic distributions in accordance with various types of concept drift (also in the field of a priori probability, i.e. dynamically unbalanced data) and a parser of the standard ARFF file format.
evaluators - containing classes for running experiments on stream data in accordance with the Test-Then-Train and Prequential methodology.
classifiers - containing sample stream classifiers,
ensembles - containing standard hybrid models of stream data classification,
metrics - containing typical classification quality metrics in data streams.

You can read more about each module in the documentation page.

Citation policy

If you use stream-learn in a scientific publication, we would appreciate citation to the following paper:

@article{Ksieniewicz2022, doi = {10.1016/j.neucom.2021.10.120}, url = {https://doi.org/10.1016/j.neucom.2021.10.120}, year = {2022}, month = jan, publisher = {Elsevier {BV}}, author = {P. Ksieniewicz and P. Zyblewski}, title = {stream-learn {\textemdash} open-source Python library for difficult data stream batch analysis}, journal = {Neurocomputing} }

Quick start guide

Installation

To use the stream-learn package, it will be absolutely useful to install it. Fortunately, it is available in the PyPI repository, so you may install it using pip:

shell pip3 install -U stream-learn

stream-learn is also avaliable with conda:

shell conda install stream-learn -c w4k2 -c conda-forge

You can also install the module cloned from Github using the setup.py file if you have a strange, but perhaps legitimate need:

shell git clone https://github.com/w4k2/stream-learn.git cd stream-learn make install

Preparing experiments

1. Classifier

In order to conduct experiments, a declaration of four elements is necessary. The first is the estimator, which must be compatible with the scikit-learn API and, in addition, implement the partial_fit() method, allowing you to re-fit the already built model. For example, we'll use the standard Gaussian Naive Bayes algorithm:

python from sklearn.naive_bayes import GaussianNB clf = GaussianNB()

2. Data Stream

The next element is the data stream that we aim to process. In the example we will use a synthetic stream consisting of shocking number of 100 chunks and containing precisely one concept drift. We will prepare it using the StreamGenerator() class of the stream-learn module:

python from strlearn.streams import StreamGenerator stream = StreamGenerator(n_chunks=100, n_drifts=1)

3. Metrics

The third requirement of the experiment is to specify the metrics used in the evaluation of the methods. In the example, we will use the accuracy metric available in scikit-learn and the precision from the stream-learn module:

python from sklearn.metrics import accuracy_score from strlearn.metrics import precision metrics = [accuracy_score, precision]

4. Evaluator

The last necessary element of processing is the evaluator, i.e. the method of conducting the experiment. For example, we will choose the Test-Then-Train paradigm, described in more detail in User Guide. It is important to note, that we need to provide the metrics that we will use in processing at the point of initializing the evaluator. In the case of none metrics given, it will use default pair of accuracy and balanced accuracy scores:

python from strlearn.evaluators import TestThenTrain evaluator = TestThenTrain(metrics)

Processing and understanding results

Once all processing requirements have been met, we can proceed with the evaluation. To start processing, call the evaluator's process method, feeding it with the stream and classifier::

python evaluator.process(stream, clf)

The results obtained are stored in the scores atribute of evaluator. If we print it on the screen, we may be able to observe that it is a three-dimensional numpy array with dimensions (1, 29, 2).

The first dimension is the index of a classifier submitted for processing. In the example above, we used only one model, but it is also possible to pass a tuple or list of classifiers that will be processed in parallel (See User Guide).
The second dimension specifies the instance of evaluation, which in the case of Test-Then-Train methodology directly means the index of the processed chunk.
The third dimension indicates the metric used in the processing.

Using this knowledge, we may finally try to illustrate the results of our simple experiment in the form of a plot::

```python import matplotlib.pyplot as plt

plt.figure(figsize=(6,3))

for m, metric in enumerate(metrics): plt.plot(evaluator.scores[0, :, m], label=metric.name)

plt.title("Basic example of stream processing") plt.ylim(0, 1) plt.ylabel('Quality') plt.xlabel('Chunk')

plt.legend() ```

Owner

Name: Katedra Systemów i Sieci Komputerowych
Login: w4k2
Kind: organization
Location: Wrocław, Poland

Website: http://kssk.pwr.edu.pl
Repositories: 90
Profile: https://github.com/w4k2

Department of Systems and Computer Networks

GitHub Events

Total

Watch event: 2
Delete event: 3
Issue comment event: 3
Push event: 16
Pull request review event: 1
Pull request review comment event: 1
Pull request event: 5
Create event: 3

Last Year

Watch event: 2
Delete event: 3
Issue comment event: 3
Push event: 16
Pull request review event: 1
Pull request review comment event: 1
Pull request event: 5
Create event: 3

Committers

Last synced: almost 3 years ago

All Time

Total Commits: 594
Total Committers: 13
Avg Commits per committer: 45.692
Development Distribution Score (DDS): 0.554

Top Committers

Name	Email	Commits
Pawel Ksieniewicz	p**l@k**m	265
Paweł Ksieniewicz	p**z@p**l	117
TibetanSandFox	p**i@p**l	99
Paweł	x**s@m**l	57
jedrzejkozal	j**l@g**m	25
chkoar	i**r@g**m	9
JakubKlik	a**9@g**m	6
JakubKlik	j**i@p**l	5
Bogdan Gulowaty	b**y@g**m	4
Joana	a**4@v**l	3
JKomorniczak	4**k@u**m	2
Bogdan Gulowaty	b**y@u**m	1
xehivs	x**s@p**n	1

Committer Domains (Top 20 + Academic)

pwr.edu.pl: 3 pop-os.localdomain: 1 vp.pl: 1 ksienie.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 22
Total pull requests: 20
Average time to close issues: about 1 year
Average time to close pull requests: 10 days
Total issue authors: 9
Total pull request authors: 6
Average comments per issue: 0.32
Average comments per pull request: 0.45
Merged pull requests: 18
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 6
Average time to close issues: N/A
Average time to close pull requests: 5 days
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.83
Merged pull requests: 5
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

xehivs (11)
bgulowaty (3)
chkoar (2)
ZahirBilal (1)
yousefabdi (1)
TibetanSandFox (1)
francoispichard (1)
MaxHalford (1)

Pull Request Authors

jedrzejkozal (9)
JakubKlik (4)
JKomorniczak (2)
bgulowaty (2)
chkoar (2)
xehivs (1)

Top Labels

Issue Labels

enhancement (2) bug (1)

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- pypi 42 last-month

Total dependent packages: 1
(may contain duplicates)
Total dependent repositories: 3
(may contain duplicates)
Total versions: 42
Total maintainers: 5

pypi.org: stream-learn

The stream-learn module is a set of tools necessary for processing data streams using scikit-learn estimators.

Homepage: https://w4k2.github.io/stream-learn/
Documentation: https://stream-learn.readthedocs.io/
License: GPL-3.0
Latest release: 0.8.24
published over 3 years ago

Versions: 41
Dependent Packages: 1
Dependent Repositories: 2
Downloads: 35 Last month

Rankings

Forks count: 8.4%

Stargazers count: 8.8%

Dependent packages count: 10.1%

Average: 10.9%

Dependent repos count: 11.6%

Downloads: 15.8%

Maintainers (5)

xehivs jedrzejkozal JKomorniczak swojciechowski TibetanSandFox

Last synced: 6 months ago

pypi.org: hspectral

Toolbox for hyperspectral data.

Homepage: https://github.com/w4k2/stream-learn
Documentation: https://hspectral.readthedocs.io/
License: MIT
Latest release: 0.1
published almost 8 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 7 Last month

Rankings

Forks count: 8.2%

Stargazers count: 9.0%

Dependent packages count: 10.1%

Dependent repos count: 21.6%

Average: 26.2%

Downloads: 82.2%

Maintainers (1)

xehivs

Last synced: 6 months ago

Dependencies

.github/workflows/codeql-analysis.yml actions

actions/checkout v2 composite
github/codeql-action/analyze v1 composite
github/codeql-action/autobuild v1 composite
github/codeql-action/init v1 composite

requirements.txt pypi

matplotlib *
numpy *
numpydoc *
pillow *
problexity >=0.5.1
requests *
scikit-learn *
scipy *
setuptools *
sphinx *
sphinx_gallery *
sphinx_rtd_theme *
sphinxcontrib-bibtex *
tqdm *

stream-learn

Science Score: 57.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

stream-learn

Citation policy

Quick start guide

Installation

Preparing experiments

1. Classifier

2. Data Stream

3. Metrics

4. Evaluator

Processing and understanding results

Owner

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: stream-learn

Rankings

Maintainers (5)

pypi.org: hspectral

Rankings

Maintainers (1)

Dependencies