slearn

Symbolic sequence learning package

https://github.com/nla-group/slearn

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.9%) to scientific vocabulary

Keywords

machine-learning-algorithms symbolic-sequence
Last synced: 6 months ago · JSON representation ·

Repository

Symbolic sequence learning package

Basic Info
  • Host: GitHub
  • Owner: nla-group
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 15 MB
Statistics
  • Stars: 12
  • Watchers: 1
  • Forks: 2
  • Open Issues: 0
  • Releases: 1
Topics
machine-learning-algorithms symbolic-sequence
Created over 4 years ago · Last pushed 8 months ago
Metadata Files
Readme License Code of conduct Citation

README.md

slearn: Python package for learning symbolic sequences

Build Status PyPI version License: MIT Anaconda-Server Badge Documentation Status

Symbolic representations of time series have demonstrated their effectiveness in tasks such as motif discovery, clustering, classification, forecasting, and anomaly detection. These methods not only reduce the dimensionality of time series data but also accelerate downstream tasks. Elsworth and Güttel [Time Series Forecasting Using LSTM Networks: A Symbolic Approach, arXiv, 2020] have shown that symbolic forecasting significantly reduces the sensitivity of Long Short-Term Memory (LSTM) networks to hyperparameter settings. However, deploying machine learning algorithms at the symbolic level—rather than on raw time series data—remains a challenging problem for many practical applications. To support the research community and streamline the application of machine learning to symbolic representations, we developed the slearn Python library. This library offers APIs for symbolic sequence generation, complexity measurement, and machine learning model training on symbolic data. We will illustrate several core features and use cases below.

Install

Install the slearn package simply by

pip

pip install slearn

conda

conda install -c conda-forge slearn

To check which version you install, please use: conda list slearn

Usage

Generate strings with customized complexity

A key feature of skearn is its ability to compute distances between symbolic sequences, enabling similarity or dissimilarity measurements after transformation. The library includes the LZWStringLibrary, which supports string distance computation based on Lempel-Ziv-Welch (LZW) complexity.

skearn enables the generation of strings of tunable complexity using the LZW compressing method as base to approximate Kolmogorov complexity. It also contains the tools for the exploration of the hyperparameter space of commonly used RNNs as well as novel ones. The skearn library uses the LZWStringLibrary to compute distances between symbolic sequences. The distance measure is based on the LZW complexity, which quantifies the complexity of a string by counting the number of unique substrings in its LZW compression dictionary. The library provides a method called distance in the LZWStringLibrary class to compute the distance between two strings, which can be used to compare symbolic representations of time series. The distance measure is typically normalized and leverages the LZW complexity to provide a similarity score between two sequences. This is particularly useful when comparing time series that have been transformed into symbolic sequences using methods like SAX.

python from slearn import * df_strings = lzw_string_library(symbols=3, complexity=[4, 9], random_state=0) print(df_strings)

Output: nr_symbols LZW_complexity length string 0 3 4 4 ACBB 1 3 9 11 CBACBCABABB

Benchmarking Transformers and RNNs performance for memorizing capability

slean offers benchmarking tools for compare deep models ability to memorize. It will automatically generated analyzed documents and visualization for tested models via interface benchmark_models. One can either use built-in models or design their own models following model examples. The example can be viewed below.
```python from slearn.deepmodels import (LSTMModel, GRUModel, TransformerModel, GPTLikeModel) # use built-in models or customized your own models. from slearn.simulation import benchmarkmodels

modellist = [LSTMModel, GRUModel, TransformerModel, GPTLikeModel] benchmarkmodels(modellist, symbolslist=[2, 4, 6, 8], # number of distinctive numbers complexities=[210, 230, 250, 270, 290], # complexity, the higher complexity indicates a tougher task sequencelengths=[3500], windowsize=100, validationlength=100, stoppingloss=0.1, maxepochs=999, numruns=5, units=[128], layers=[1, 2, 3], batchsize=256, maxstringspercomplexity=1000, learning_rates=[1e-3, 1e-4] ) ```

Symbolic time seroes representation

The following table summarizes the implemented Symbolic Aggregate Approximation (SAX) variants and the ABBA method for time series representation:

| Algorithm | Time Series Type | Segmentation | Features Extracted | Symbolization | Reconstruction | |-----------|------------------|--------------|--------------------|---------------|----------------| | SAX | Univariate | Fixed-size segments | Mean (PAA) | Gaussian breakpoints, single symbol per segment | Piecewise constant from PAA values | | SAX-TD | Univariate | Fixed-size segments | Mean (PAA), slope | Mean to symbol, trend suffix ('u', 'd', 'f') | Linear trends from PAA and slopes | | eSAX | Univariate | Fixed-size segments | Min, mean, max | Three symbols per segment (min, mean, max) | Quadratic interpolation from min, mean, max | | mSAX | Multivariate | Fixed-size segments | Mean per dimension | One symbol per dimension per segment | Piecewise constant per dimension | | aSAX | Univariate | Adaptive segments (based on local variance) | Mean (PAA) | Gaussian breakpoints, single symbol per segment | Piecewise constant from adaptive segments | | ABBA | Univariate | Adaptive piecewise linear segments | Length, increment | Clustering (k-means), symbols assigned to clusters | Piecewise linear from cluster centers |

  • SAX: Standard SAX with fixed-size segments and mean-based symbolization.
  • SAX-TD: Extends SAX with trend information (up, down, flat) per segment.
  • eSAX: Enhanced SAX capturing min, mean, and max per segment for smoother reconstruction.
  • mSAX: Multivariate SAX, processing each dimension independently.
  • aSAX: Adaptive SAX, adjusting segment sizes based on local variance for better representation of variable patterns.
  • ABBA: Adaptive Brownian Bridge-based Aggregation, using piecewise linear segmentation and k-means clustering for symbolization (based on https://github.com/nla-group/fABBA).

```python from slearn.symbols import *

def testsaxvariant(model, ts, t, name, ismultivariate=False): symbols = model.fittransform(ts) recon = model.inverse_transform() print(f"{name} reconstructed length: {len(recon)}") rmse = np.sqrt(np.mean((ts - recon) ** 2)) return rmse

Generate test time series

np.random.seed(42) t = np.linspace(0, 10, 100) ts = np.sin(t) + np.random.normal(0, 0.1, 100) # Univariate, main test ts_multi = np.vstack([np.sin(t), np.cos(t)]).T + np.random.normal(0, 0.1, (100, 2)) # Multivariate

sax = SAX(windowsize=10, alphabetsize=8) rmse = testsaxvariant(sax, ts, t, "SAX")

saxtd = SAXTD(windowsize=10, alphabetsize=8) rmse = testsaxvariant(saxtd, ts, t, "SAX-TD")

esax = ESAX(windowsize=10, alphabetsize=8) rmse = testsaxvariant(esax, ts, t, "eSAX")

msax = MSAX(windowsize=10, alphabetsize=8) rmse = testsaxvariant(msax, tsmulti, t, "mSAX", ismultivariate=True)

asax = ASAX(nsegments=10, alphabetsize=8) rmse = testsaxvariant(asax, ts, t, "aSAX") ```

String distance and similarity metrics

slearn includes the implemented interface for string distance and similarity metrics as well as their normalized implementations, each strictly adhering to their formal definitions. ```python from slearn.dmetric import *

print(dameraulevenshteindistance("cat", "act")) print(jarowinklerdistance("martha", "marhta"))

print(normalizeddameraulevenshteindistance("cat", "act")) print(normalizedjarowinklerdistance("martha", "marhta"))

```

Model support

slearn currently supports SAX, ABBA, and fABBA symbolic representation, and the machine learning classifiers as below:

| Support Classifiers | Parameter call | | ---- | ---- | | Multi-layer Perceptron |'MLPClassifier' | | K-Nearest Neighbors | 'KNeighborsClassifier' | | Gaussian Naive Bayes | 'GaussianNB'| | Decision Tree | 'DecisionTreeClassifier' | | Support Vector Classification | 'SVC' | | Radial-basis Function Kernel | 'RBF'| | Logistic Regression | 'LogisticRegression' | | Quadratic Discriminant Analysis | 'QuadraticDiscriminantAnalysis' | | AdaBoost classifier | 'AdaBoostClassifier' | | Random Forest | 'RandomForestClassifier' |

Our documentation is available.

Citation

This slearn implementation is maintained by Roberto Cahuantzi (University of Manchester), Xinye Chen (Charles University Prague), and Stefan Güttel (University of Manchester). If you use the function of LZWStringLibrary in your research, or if you find slearn useful in your work, please consider citing the paper below. If you have any problems or questions, just drop us an email.

bibtex @InProceedings{10.1007/978-3-031-37963-5_53, author="Cahuantzi, Roberto and Chen, Xinye and G{\"u}ttel, Stefan", title="A Comparison of LSTM and GRU Networks for Learning Symbolic Sequences", booktitle="Intelligent Computing", year="2023", publisher="Springer Nature Switzerland", pages="771--785" }

License

This project is licensed under the terms of the MIT license.

Contributing

Contributing to this repo is welcome! We will work through all the pull requests and try to merge into main branch.

TO DO LIST: * language modeling functionalities * comphrehensive documentation * performance optimization

Owner

  • Name: nla-group
  • Login: nla-group
  • Kind: organization

Citation (CITATION.cff)

cff-version: 0.0.1
message: "If you use this software, please cite it as below."
authors:
- family-names: "Cahuantzi"
  given-names: "Roberto"
  orcid: "https://orcid.org/0000-0002-0212-6825"
- family-names: "Chen"
  given-names: "Xinye"
  orcid: "https://orcid.org/0000-0003-1778-393X"
- family-names: "G\"{u}ttel"
  given-names: "Stefan"
  orcid: "https://orcid.org/0000-0003-1494-4478"
title: "slearn"
version: 0.2.5
doi: 10.5281/zenodo.1234
date-released: 2021-11-23
url: "https://github.com/nla-group/slearn"

GitHub Events

Total
  • Watch event: 3
  • Delete event: 37
  • Member event: 1
  • Issue comment event: 1
  • Push event: 51
  • Pull request event: 35
  • Create event: 24
Last Year
  • Watch event: 3
  • Delete event: 37
  • Member event: 1
  • Issue comment event: 1
  • Push event: 51
  • Pull request event: 35
  • Create event: 24

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 229
  • Total Committers: 3
  • Avg Commits per committer: 76.333
  • Development Distribution Score (DDS): 0.092
Past Year
  • Commits: 20
  • Committers: 1
  • Avg Commits per committer: 20.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Null 4****e 208
chenxinye c****t@y****m 17
Stefan Güttel g****l 4

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 19
  • Average time to close issues: N/A
  • Average time to close pull requests: about 1 month
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 18
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 16
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 15
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • chenxinye (29)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 243 last-month
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 1
    (may contain duplicates)
  • Total versions: 29
  • Total maintainers: 1
pypi.org: slearn

A package linking symbolic representation with sklearn for time series prediction

  • Versions: 27
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 243 Last month
Rankings
Dependent packages count: 10.1%
Downloads: 13.9%
Average: 17.2%
Stargazers count: 17.7%
Dependent repos count: 21.6%
Forks count: 22.6%
Maintainers (1)
Last synced: 7 months ago
conda-forge.org: slearn
  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 34.0%
Average: 49.3%
Dependent packages count: 51.2%
Stargazers count: 54.5%
Forks count: 57.4%
Last synced: 6 months ago

Dependencies

info/requires.txt pypi
  • lightgbm *
  • numpy >=1.7.2
  • pandas *
  • requests *
  • scikit-learn *
requirements.txt pypi
  • lightgbm *
  • numpy *
  • pandas *
  • scikit-learn *
  • scipy *
.github/workflows/Draft-pdf.yml actions
  • actions/checkout v2 composite
  • actions/upload-artifact v1 composite
  • openjournals/openjournals-draft-action master composite