slearn

Symbolic sequence learning package

https://github.com/nla-group/slearn

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.9%) to scientific vocabulary

Keywords

machine-learning-algorithms symbolic-sequence

Last synced: 10 months ago · JSON representation ·

Repository

Symbolic sequence learning package

Basic Info

Host: GitHub
Owner: nla-group
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 15 MB

Statistics

Stars: 12
Watchers: 1
Forks: 2
Open Issues: 0
Releases: 1

Topics

machine-learning-algorithms symbolic-sequence

Created over 4 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Code of conduct Citation

slearn: Python package for learning symbolic sequences

Symbolic representations of time series have demonstrated their effectiveness in tasks such as motif discovery, clustering, classification, forecasting, and anomaly detection. These methods not only reduce the dimensionality of time series data but also accelerate downstream tasks. Elsworth and Güttel [Time Series Forecasting Using LSTM Networks: A Symbolic Approach, arXiv, 2020] have shown that symbolic forecasting significantly reduces the sensitivity of Long Short-Term Memory (LSTM) networks to hyperparameter settings. However, deploying machine learning algorithms at the symbolic level—rather than on raw time series data—remains a challenging problem for many practical applications. To support the research community and streamline the application of machine learning to symbolic representations, we developed the slearn Python library. This library offers APIs for symbolic sequence generation, complexity measurement, and machine learning model training on symbolic data. We will illustrate several core features and use cases below.

Install

Install the slearn package simply by

pip

pip install slearn

conda

conda install -c conda-forge slearn

To check which version you install, please use: conda list slearn

Usage

Generate strings with customized complexity

A key feature of skearn is its ability to compute distances between symbolic sequences, enabling similarity or dissimilarity measurements after transformation. The library includes the LZWStringLibrary, which supports string distance computation based on Lempel-Ziv-Welch (LZW) complexity.

skearn enables the generation of strings of tunable complexity using the LZW compressing method as base to approximate Kolmogorov complexity. It also contains the tools for the exploration of the hyperparameter space of commonly used RNNs as well as novel ones. The skearn library uses the LZWStringLibrary to compute distances between symbolic sequences. The distance measure is based on the LZW complexity, which quantifies the complexity of a string by counting the number of unique substrings in its LZW compression dictionary. The library provides a method called distance in the LZWStringLibrary class to compute the distance between two strings, which can be used to compare symbolic representations of time series. The distance measure is typically normalized and leverages the LZW complexity to provide a similarity score between two sequences. This is particularly useful when comparing time series that have been transformed into symbolic sequences using methods like SAX.

python from slearn import * df_strings = lzw_string_library(symbols=3, complexity=[4, 9], random_state=0) print(df_strings)

Output: nr_symbols LZW_complexity length string 0 3 4 4 ACBB 1 3 9 11 CBACBCABABB

Benchmarking Transformers and RNNs performance for memorizing capability

slean offers benchmarking tools for compare deep models ability to memorize. It will automatically generated analyzed documents and visualization for tested models via interface benchmark_models. One can either use built-in models or design their own models following model examples. The example can be viewed below.
```python from slearn.deepmodels import (LSTMModel, GRUModel, TransformerModel, GPTLikeModel) # use built-in models or customized your own models. from slearn.simulation import benchmarkmodels

modellist = [LSTMModel, GRUModel, TransformerModel, GPTLikeModel] benchmarkmodels(modellist, symbolslist=[2, 4, 6, 8], # number of distinctive numbers complexities=[210, 230, 250, 270, 290], # complexity, the higher complexity indicates a tougher task sequencelengths=[3500], windowsize=100, validationlength=100, stoppingloss=0.1, maxepochs=999, numruns=5, units=[128], layers=[1, 2, 3], batchsize=256, maxstringspercomplexity=1000, learning_rates=[1e-3, 1e-4] ) ```

Symbolic time seroes representation

The following table summarizes the implemented Symbolic Aggregate Approximation (SAX) variants and the ABBA method for time series representation:

| Algorithm | Time Series Type | Segmentation | Features Extracted | Symbolization | Reconstruction | |-----------|------------------|--------------|--------------------|---------------|----------------| | SAX | Univariate | Fixed-size segments | Mean (PAA) | Gaussian breakpoints, single symbol per segment | Piecewise constant from PAA values | | SAX-TD | Univariate | Fixed-size segments | Mean (PAA), slope | Mean to symbol, trend suffix ('u', 'd', 'f') | Linear trends from PAA and slopes | | eSAX | Univariate | Fixed-size segments | Min, mean, max | Three symbols per segment (min, mean, max) | Quadratic interpolation from min, mean, max | | mSAX | Multivariate | Fixed-size segments | Mean per dimension | One symbol per dimension per segment | Piecewise constant per dimension | | aSAX | Univariate | Adaptive segments (based on local variance) | Mean (PAA) | Gaussian breakpoints, single symbol per segment | Piecewise constant from adaptive segments | | ABBA | Univariate | Adaptive piecewise linear segments | Length, increment | Clustering (k-means), symbols assigned to clusters | Piecewise linear from cluster centers |

SAX: Standard SAX with fixed-size segments and mean-based symbolization.
SAX-TD: Extends SAX with trend information (up, down, flat) per segment.
eSAX: Enhanced SAX capturing min, mean, and max per segment for smoother reconstruction.
mSAX: Multivariate SAX, processing each dimension independently.
aSAX: Adaptive SAX, adjusting segment sizes based on local variance for better representation of variable patterns.
ABBA: Adaptive Brownian Bridge-based Aggregation, using piecewise linear segmentation and k-means clustering for symbolization (based on https://github.com/nla-group/fABBA).

```python from slearn.symbols import *

def testsaxvariant(model, ts, t, name, ismultivariate=False): symbols = model.fittransform(ts) recon = model.inverse_transform() print(f"{name} reconstructed length: {len(recon)}") rmse = np.sqrt(np.mean((ts - recon) ** 2)) return rmse

Generate test time series

np.random.seed(42) t = np.linspace(0, 10, 100) ts = np.sin(t) + np.random.normal(0, 0.1, 100) # Univariate, main test ts_multi = np.vstack([np.sin(t), np.cos(t)]).T + np.random.normal(0, 0.1, (100, 2)) # Multivariate

sax = SAX(windowsize=10, alphabetsize=8) rmse = testsaxvariant(sax, ts, t, "SAX")

saxtd = SAXTD(windowsize=10, alphabetsize=8) rmse = testsaxvariant(saxtd, ts, t, "SAX-TD")

esax = ESAX(windowsize=10, alphabetsize=8) rmse = testsaxvariant(esax, ts, t, "eSAX")

msax = MSAX(windowsize=10, alphabetsize=8) rmse = testsaxvariant(msax, tsmulti, t, "mSAX", ismultivariate=True)

asax = ASAX(nsegments=10, alphabetsize=8) rmse = testsaxvariant(asax, ts, t, "aSAX") ```

String distance and similarity metrics

slearn includes the implemented interface for string distance and similarity metrics as well as their normalized implementations, each strictly adhering to their formal definitions. ```python from slearn.dmetric import *

print(dameraulevenshteindistance("cat", "act")) print(jarowinklerdistance("martha", "marhta"))

print(normalizeddameraulevenshteindistance("cat", "act")) print(normalizedjarowinklerdistance("martha", "marhta"))

```

Model support

slearn currently supports SAX, ABBA, and fABBA symbolic representation, and the machine learning classifiers as below:

| Support Classifiers | Parameter call | | ---- | ---- | | Multi-layer Perceptron |'MLPClassifier' | | K-Nearest Neighbors | 'KNeighborsClassifier' | | Gaussian Naive Bayes | 'GaussianNB'| | Decision Tree | 'DecisionTreeClassifier' | | Support Vector Classification | 'SVC' | | Radial-basis Function Kernel | 'RBF'| | Logistic Regression | 'LogisticRegression' | | Quadratic Discriminant Analysis | 'QuadraticDiscriminantAnalysis' | | AdaBoost classifier | 'AdaBoostClassifier' | | Random Forest | 'RandomForestClassifier' |

Our documentation is available.

Citation

This slearn implementation is maintained by Roberto Cahuantzi (University of Manchester), Xinye Chen (Charles University Prague), and Stefan Güttel (University of Manchester). If you use the function of LZWStringLibrary in your research, or if you find slearn useful in your work, please consider citing the paper below. If you have any problems or questions, just drop us an email.

bibtex @InProceedings{10.1007/978-3-031-37963-5_53, author="Cahuantzi, Roberto and Chen, Xinye and G{\"u}ttel, Stefan", title="A Comparison of LSTM and GRU Networks for Learning Symbolic Sequences", booktitle="Intelligent Computing", year="2023", publisher="Springer Nature Switzerland", pages="771--785" }

License

This project is licensed under the terms of the MIT license.

Contributing

Contributing to this repo is welcome! We will work through all the pull requests and try to merge into main branch.

TO DO LIST: * language modeling functionalities * comphrehensive documentation * performance optimization

Owner

Name: nla-group
Login: nla-group
Kind: organization

Repositories: 8
Profile: https://github.com/nla-group

Citation (CITATION.cff)

cff-version: 0.0.1
message: "If you use this software, please cite it as below."
authors:
- family-names: "Cahuantzi"
  given-names: "Roberto"
  orcid: "https://orcid.org/0000-0002-0212-6825"
- family-names: "Chen"
  given-names: "Xinye"
  orcid: "https://orcid.org/0000-0003-1778-393X"
- family-names: "G\"{u}ttel"
  given-names: "Stefan"
  orcid: "https://orcid.org/0000-0003-1494-4478"
title: "slearn"
version: 0.2.5
doi: 10.5281/zenodo.1234
date-released: 2021-11-23
url: "https://github.com/nla-group/slearn"

GitHub Events

Total

Watch event: 3
Delete event: 37
Member event: 1
Issue comment event: 1
Push event: 51
Pull request event: 35
Create event: 24

Last Year

Watch event: 3
Delete event: 37
Member event: 1
Issue comment event: 1
Push event: 51
Pull request event: 35
Create event: 24

Committers

Last synced: over 2 years ago

All Time

Total Commits: 229
Total Committers: 3
Avg Commits per committer: 76.333
Development Distribution Score (DDS): 0.092

Past Year

Commits: 20
Committers: 1
Avg Commits per committer: 20.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Null	4****e	208
chenxinye	c**t@y**m	17
Stefan Güttel	g****l	4

Issues and Pull Requests

Last synced: almost 2 years ago

All Time

Total issues: 0
Total pull requests: 19
Average time to close issues: N/A
Average time to close pull requests: about 1 month
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 18
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 16
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 15
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

chenxinye (29)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- pypi 243 last-month

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 1
(may contain duplicates)
Total versions: 29
Total maintainers: 1

pypi.org: slearn

A package linking symbolic representation with sklearn for time series prediction

Homepage: https://github.com/nla-group/slearn
Documentation: https://slearn.readthedocs.io/
License: MIT
Latest release: 0.2.8
published about 1 year ago

Versions: 27
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 243 Last month

Rankings

Dependent packages count: 10.1%

Downloads: 13.9%

Average: 17.2%

Stargazers count: 17.7%

Dependent repos count: 21.6%

Forks count: 22.6%

Maintainers (1)

Stefan_Xinye.NLA3

Last synced: 11 months ago

conda-forge.org: slearn

Homepage: https://github.com/nla-group/slearn
License: MIT
Latest release: 0.2.5
published about 4 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent repos count: 34.0%

Average: 49.3%

Dependent packages count: 51.2%

Stargazers count: 54.5%

Forks count: 57.4%

Last synced: 11 months ago

Dependencies

info/requires.txt pypi

lightgbm *
numpy >=1.7.2
pandas *
requests *
scikit-learn *

requirements.txt pypi

lightgbm *
numpy *
pandas *
scikit-learn *
scipy *

.github/workflows/Draft-pdf.yml actions

actions/checkout v2 composite
actions/upload-artifact v1 composite
openjournals/openjournals-draft-action master composite

slearn

Science Score: 57.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

slearn: Python package for learning symbolic sequences

Install

pip

conda

Usage

Generate strings with customized complexity

Benchmarking Transformers and RNNs performance for memorizing capability

Symbolic time seroes representation

Generate test time series

String distance and similarity metrics

Model support

Citation

License

Contributing

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: slearn

Rankings

Maintainers (1)

conda-forge.org: slearn

Rankings

Dependencies