https://github.com/eonu/sequentia

Scikit-Learn compatible HMM and DTW based sequence machine learning algorithms in Python.

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.2%) to scientific vocabulary

Keywords

classification-algorithms dtw dynamic-time-warping hidden-markov-models hmm k-nearest-neighbor-classifier knn machine-learning multivariate multivariate-timeseries python sequence-classification sequential-patterns time-series time-series-classification variable-length

Last synced: 6 months ago · JSON representation

Repository

Scikit-Learn compatible HMM and DTW based sequence machine learning algorithms in Python.

Basic Info

Host: GitHub
Owner: eonu
License: mit
Language: Python
Default Branch: master
Homepage: https://pypi.org/project/sequentia/
Size: 43.7 MB

Statistics

Stars: 66
Watchers: 3
Forks: 9
Open Issues: 1
Releases: 28

Topics

Created about 6 years ago · Last pushed about 1 year ago

Metadata Files

Readme Changelog Contributing License Code of conduct Codeowners

README.md

Sequentia

Scikit-Learn compatible HMM and DTW based sequence machine learning algorithms in Python.

^{About ·
Build Status ·
Features ·
Installation ·
Documentation ·
Examples ·
Acknowledgments ·
References ·
Contributors ·
Licensing}

About

Sequentia is a Python package that provides various classification and regression algorithms for sequential data, including methods based on hidden Markov models and dynamic time warping.

Some examples of how Sequentia can be used on sequence data include:

determining a spoken word based on its audio signal or alternative representations such as MFCCs,
predicting motion intent for gesture control from sEMG signals,
classifying hand-written characters according to their pen-tip trajectories.

Why Sequentia?

Simplicity and interpretability: Sequentia offers a limited set of machine learning algorithms, chosen specifically to be more interpretable and easier to configure than more complex alternatives such as recurrent neural networks and transformers, while maintaining a high level of effectiveness.
Familiar and user-friendly: To fit more seamlessly into the workflow of data science practitioners, Sequentia follows the ubiquitous Scikit-Learn API, providing a familiar model development process for many, as well as enabling wider access to the rapidly growing Scikit-Learn ecosystem.
Speed: Some algorithms offered by Sequentia naturally have restrictive runtime scaling, such as k-nearest neighbors. However, our implementation is optimized to the point of being multiple orders of magnitude faster than similar packages — see the Benchmarks section for more information.

Build Status

| master | dev | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | | |

Features

Models

Dynamic Time Warping + k-Nearest Neighbors (via `dtaidistance`)

Dynamic Time Warping (DTW) is a distance measure that can be applied to two sequences of different length. When used as a distance measure for the k-Nearest Neighbors (kNN) algorithm this results in a simple yet effective inference algorithm.

[x] Classification
[x] Regression
[x] Variable length sequences
[x] Multivariate real-valued observations
[x] Sakoe–Chiba band global warping constraint
[x] Dependent and independent feature warping (DTWD/DTWI)
[x] Custom distance-weighted predictions
[x] Multi-processed prediction

Hidden Markov Models (via `hmmlearn`)

A Hidden Markov Model (HMM) is a state-based statistical model which represents a sequence as a series of observations that are emitted from a collection of latent hidden states which form an underlying Markov chain. Each hidden state has an emission distribution that models its observations.

Expectation-maximization via the Baum-Welch algorithm (or forward-backward algorithm) [1] is used to derive a maximum likelihood estimate of the Markov chain probabilities and emission distribution parameters based on the provided training sequence data.

[x] Classification
[x] Variable length sequences
[x] Multivariate real-valued observations (modeled with Gaussian mixture emissions)
[x] Univariate categorical observations (modeled with discrete emissions)
[x] Linear, left-right and ergodic topologies
[x] Multi-processed training and prediction

Scikit-Learn compatibility

Sequentia (≥2.0) is compatible with the Scikit-Learn API (≥1.4), enabling for rapid development and prototyping of sequential models.

The integration relies on the use of metadata routing, which means that in most cases, the only necessary change is to add a lengths key-word argument to provide sequence length information, e.g. fit(X, y, lengths=lengths) instead of fit(X, y).

Similar libraries

As DTW k-nearest neighbors is the core algorithm offered by Sequentia, below is a comparison of the DTW k-nearest neighbors algorithm features supported by Sequentia and similar libraries.

||sequentia|aeon|tslearn|sktime|pyts| |-|:-:|:-:|:-:|:-:|:-:| |Scikit-Learn compatible|✅|✅|✅|✅|✅| |Multivariate sequences|✅|✅|✅|✅|❌| |Variable length sequences|✅|✅|➖¹|❌²|❌³| |No padding required|✅|❌|➖¹|❌²|❌³| |Classification|✅|✅|✅|✅|✅| |Regression|✅|✅|✅|✅|❌| |Preprocessing|✅|✅|✅|✅|✅| |Multiprocessing|✅|✅|✅|✅|✅| |Custom weighting|✅|✅|✅|✅|✅| |Sakoe-Chiba band constraint|✅|✅|✅|✅|✅| |Itakura paralellogram constraint|❌|✅|✅|✅|✅| |Dependent DTW (DTWD)|✅|✅|✅|✅|❌| |Independent DTW (DTWI)|✅|❌|❌|❌|✅| |Custom DTW measures|❌⁴|✅|❌|✅|✅|

¹tslearn supports variable length sequences with padding, but doesn't seem to mask the padding.
²sktime does not support variable length sequences, so they are padded (and padding is not masked).
³pyts does not support variable length sequences, so they are padded (and padding is not masked).
⁴sequentia only supports dtaidistance, which is one of the fastest DTW libraries as it is written in C.

Benchmarks

To compare the above libraries in runtime performance on dynamic time warping k-nearest neighbors classification tasks, a simple benchmark was performed on a univariate sequence dataset.

The Free Spoken Digit Dataset was used for benchmarking and consists of:

3000 recordings of 10 spoken digits (0-9)
- 50 recordings of each digit for each of 6 speakers
- 1500 used for training, 1500 used for testing (split via label stratification)
13 features (MFCCs)
- Only the first feature was used as not all of the above libraries support multivariate sequences
Sequence length statistics: (min 6, median 17, max 92)

Each result measures the total time taken to complete training and prediction repeated 10 times.

All of the above libraries support multiprocessing, and prediction was performed using 16 workers.

^*: sktime, tslearn and pyts seem to not mask padding, which may result in incorrect predictions.

Device information: - Product: Lenovo ThinkPad T14s (Gen 6) - Processor: AMD Ryzen™ AI 7 PRO 360 (8 cores, 16 threads, 2-5GHz) - Memory: 64 GB LPDDR5X-7500MHz - Solid State Drive: 1 TB SSD M.2 2280 PCIe Gen4 Performance TLC Opal - Operating system: Fedora Linux 41 (Workstation Edition)

Installation

The latest stable version of Sequentia can be installed with the following command:

console pip install sequentia

C libraries

For optimal performance when using any of the k-NN based models, it is important that the correct dtaidistance C libraries are accessible.

Please see the dtaidistance installation guide for troubleshooting if you run into C compilation issues, or if using k-NN based models with use_c=True results in a warning.

You can use the following to check if the appropriate C libraries are available.

python from dtaidistance import dtw dtw.try_import_c()

If these libraries are unavailable, Sequentia will fall back to using a Python alternative.

Development

Please see the contribution guidelines to see installation instructions for contributing to Sequentia.

Documentation

Documentation for the package is available on Read The Docs.

Examples

Demonstration of classifying multivariate sequences into two classes using the KNNClassifier.

This example also shows a typical preprocessing workflow, as well as compatibility with Scikit-Learn for pipelining and hyper-parameter optimization.

First, we create some sample multivariate input data consisting of three sequences with two features.

Sequentia expects sequences to be concatenated and represented as a single NumPy array.
Sequence lengths are provided separately and used to decode the sequences when needed.

This avoids the need for complex structures such as lists of nested arrays with different lengths, or a 3D array with wasteful and annoying padding.

```python import numpy as np

Sequence data

X = np.array([ # Sequence 1 - Length 3 [1.2 , 7.91], [1.34, 6.6 ], [0.92, 8.08], # Sequence 2 - Length 5 [2.11, 6.97], [1.83, 7.06], [1.54, 5.98], [0.86, 6.37], [1.21, 5.8 ], # Sequence 3 - Length 2 [1.7 , 6.22], [2.01, 5.49], ])

Sequence lengths

lengths = np.array([3, 5, 2])

Sequence classes

y = np.array([0, 1, 1]) ```

With this data, we can train a KNNClassifier and use it for prediction and scoring.

Note: Each of the fit(), predict() and score() methods require the sequence lengths to be provided in addition to the sequence data X and labels y.

```python from sequentia.models import KNNClassifier

Initialize and fit the classifier

clf = KNNClassifier(k=1) clf.fit(X, y, lengths=lengths)

Make predictions based on the provided sequences

y_pred = clf.predict(X, lengths=lengths)

Make predicitons based on the provided sequences and calculate accuracy

acc = clf.score(X, y, lengths=lengths) ```

Alternatively, we can use sklearn.preprocessing.Pipeline to build a more complex preprocessing pipeline:

Individually denoise each sequence by applying a median filter to each sequence.
Individually standardize each sequence by subtracting the mean and dividing the s.d. for each feature.
Reduce the dimensionality of the data to a single feature by using PCA.
Pass the resulting transformed data into a KNNClassifier.

Note: Steps 1 and 2 use IndependentFunctionTransformer provided by Sequentia to apply the specified transformation to each sequence in X individually, rather than using FunctionTransformer from Scikit-Learn which would transform the entire X array once, treating it as a single sequence.

```python from sklearn.preprocessing import scale from sklearn.decomposition import PCA from sklearn.pipeline import Pipeline

from sequentia.preprocessing import IndependentFunctionTransformer, median_filter

Create a preprocessing pipeline that feeds into a KNNClassifier

pipeline = Pipeline([ ('denoise', IndependentFunctionTransformer(medianfilter)), ('scale', IndependentFunctionTransformer(scale)), ('pca', PCA(ncomponents=1)), ('knn', KNNClassifier(k=1)) ])

Fit the pipeline to the data

pipeline.fit(X, y, lengths=lengths)

Predict classes for the sequences and calculate accuracy

y_pred = pipeline.predict(X, lengths=lengths)

Make predicitons based on the provided sequences and calculate accuracy

acc = pipeline.score(X, y, lengths=lengths) ```

For hyper-parameter optimization, Sequentia provides a sequentia.model_selection sub-package that includes most of the hyper-parameter search and cross-validation methods provided by sklearn.model_selection, but adapted to work with sequences.

For instance, we can perform a grid search with k-fold cross-validation stratifying over labels in order to find an optimal value for the number of neighbors in KNNClassifier for the above pipeline.

```python from sequentia.model_selection import StratifiedKFold, GridSearchCV

Define hyper-parameter search and specify cross-validation method

search = GridSearchCV( # Re-use the above pipeline estimator=Pipeline([ ('denoise', IndependentFunctionTransformer(medianfilter)), ('scale', IndependentFunctionTransformer(scale)), ('pca', PCA(ncomponents=1)), ('knn', KNNClassifier(k=1)) ]), # Try a range of values of k paramgrid={"knnk": [1, 2, 3, 4, 5]}, # Specify k-fold cross-validation with label stratification using 4 splits cv=StratifiedKFold(nsplits=4), )

Perform cross-validation over accuracy and retrieve the best model

search.fit(X, y, lengths=lengths) clf = search.bestestimator

Make predicitons using the best model and calculate accuracy

acc = clf.score(X, y, lengths=lengths) ```

Acknowledgments

In earlier versions of the package, an approximate DTW implementation fastdtw was used in hopes of speeding up k-NN predictions, as the authors of the original FastDTW paper [2] claim that approximated DTW alignments can be computed in linear memory and time, compared to the O(N²) runtime complexity of the usual exact DTW implementation.

I was contacted by Prof. Eamonn Keogh whose work makes the surprising revelation that FastDTW is generally slower than the exact DTW algorithm that it approximates [3]. Upon switching from the fastdtw package to dtaidistance (a very solid implementation of exact DTW with fast pure C compiled functions), DTW k-NN prediction times were indeed reduced drastically.

I would like to thank Prof. Eamonn Keogh for directly reaching out to me regarding this finding.

References

[1]	Lawrence R. Rabiner. "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition" Proceedings of the IEEE 77 (1989), no. 2, 257-86.
[2]	Stan Salvador & Philip Chan. "FastDTW: Toward accurate dynamic time warping in linear time and space." Intelligent Data Analysis 11.5 (2007), 561-580.
[3]	Renjie Wu & Eamonn J. Keogh. "FastDTW is approximate and Generally Slower than the Algorithm it Approximates" IEEE Transactions on Knowledge and Data Engineering (2020), 1–1.

Contributors

All contributions to this repository are greatly appreciated. Contribution guidelines can be found here.

_eonu	_Prhmma	_manisci	_jonnor

Licensing

Sequentia is released under the MIT license.

Certain parts of source code are heavily adapted from Scikit-Learn. Such files contain a copy of their license.

Owner

Name: Edwin Onuonga
Login: eonu
Kind: user
Location: Edinburgh, United Kingdom
Company: @hazy

Website: https://eonu.net
Twitter: edwinonuonga
Repositories: 6
Profile: https://github.com/eonu

Learning to make machines learn.

GitHub Events

Total

Create event: 14
Release event: 2
Issues event: 4
Watch event: 3
Delete event: 13
Issue comment event: 16
Push event: 46
Pull request event: 27
Fork event: 1

Last Year

Create event: 14
Release event: 2
Issues event: 4
Watch event: 3
Delete event: 13
Issue comment event: 16
Push event: 46
Pull request event: 27
Fork event: 1

Committers

Last synced: over 1 year ago

All Time

Total Commits: 214
Total Committers: 3
Avg Commits per committer: 71.333
Development Distribution Score (DDS): 0.019

Past Year

Commits: 7
Committers: 2
Avg Commits per committer: 3.5
Development Distribution Score (DDS): 0.429

Top Committers

Name	Email	Commits
Edwin Onuonga	e**a@g**m	210
eonu	e****u	3
Prhmma	p**a@g**m	1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 37
Total pull requests: 96
Average time to close issues: 3 months
Average time to close pull requests: about 10 hours
Total issue authors: 4
Total pull request authors: 2
Average comments per issue: 1.05
Average comments per pull request: 0.11
Merged pull requests: 94
Bot issues: 0
Bot pull requests: 3

Past Year

Issues: 2
Pull requests: 7
Average time to close issues: 17 days
Average time to close pull requests: about 1 hour
Issue authors: 1
Pull request authors: 2
Average comments per issue: 0.5
Average comments per pull request: 1.0
Merged pull requests: 7
Bot issues: 0
Bot pull requests: 1

View more stats

Top Authors

Issue Authors

eonu (34)
Vaibhav-2022 (1)
manisci (1)
franz101 (1)

Pull Request Authors

eonu (106)
github-actions[bot] (8)

Top Labels

Issue Labels

enhancement (15) priority: 2 (5) priority: 1 (5) bug (4) priority: 3 (3) question (3) priority: 5 (2) priority: 4 (2) documentation (1) help wanted (1) invalid (1) good first issue (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 118 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 31
Total maintainers: 1

pypi.org: sequentia

Scikit-Learn compatible HMM and DTW based sequence machine learning algorithms in Python.

Homepage: https://github.com/eonu/sequentia
Documentation: https://sequentia.readthedocs.io/en/latest
License: MIT
Latest release: 2.6.0
published about 1 year ago

Versions: 31
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 118 Last month

Rankings

Stargazers count: 9.2%

Dependent packages count: 9.8%

Forks count: 13.3%

Average: 17.2%

Dependent repos count: 21.8%

Downloads: 31.9%

Maintainers (1)

eonu

Last synced: 6 months ago

Dependencies

.github/workflows/create-github-release.yml actions

WyriHaximus/github-action-get-previous-tag v1 composite
actions/checkout v4 composite
ncipollo/release-action v1 composite

.github/workflows/create-pypi-release.yml actions

JRubics/poetry-publish v1.17 composite
actions/checkout v4 composite

.github/workflows/create-release-pr.yml actions

abatilo/actions-poetry v2 composite
actions/checkout v4 composite
actions/setup-python v4 composite
orhun/git-cliff-action v2 composite
peter-evans/create-pull-request v5.0.2 composite
rickstaa/action-create-tag v1 composite

.github/workflows/semantic-pull-request.yml actions

amannn/action-semantic-pull-request v5 composite

pyproject.toml pypi

invoke 2.2.0 base
tox 4.11.3 base
pre-commit >=3 develop
enum-tools >=0.11,<1 docs
sphinx ^7.2.4 docs
sphinx-autobuild ^2021.3.14 docs
pydoclint 0.3.8 lint
ruff 0.1.3 lint
dtaidistance ^2.3.10
hmmlearn >=0.2.8,<1
joblib ^1.2
numba >=0.56,<1
numpy ^1.19.5
pydantic ^2
python ^3.11
scikit-learn ^1.4
scipy ^1.6
pytest ^7.4.0 tests
pytest-cov ^4.1.0 tests

https://github.com/eonu/sequentia

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Sequentia

About

Why Sequentia?

Build Status

Features

Models

Dynamic Time Warping + k-Nearest Neighbors (via dtaidistance)

Hidden Markov Models (via hmmlearn)

Scikit-Learn compatibility

Similar libraries

Benchmarks

Installation

C libraries

Development

Documentation

Examples

Sequence data

Sequence lengths

Sequence classes

Initialize and fit the classifier

Make predictions based on the provided sequences

Make predicitons based on the provided sequences and calculate accuracy

Create a preprocessing pipeline that feeds into a KNNClassifier

Fit the pipeline to the data

Predict classes for the sequences and calculate accuracy

Make predicitons based on the provided sequences and calculate accuracy

Define hyper-parameter search and specify cross-validation method

Perform cross-validation over accuracy and retrieve the best model

Make predicitons using the best model and calculate accuracy

Acknowledgments

References

Contributors

Licensing

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: sequentia

Rankings

Maintainers (1)

Dependencies

Dynamic Time Warping + k-Nearest Neighbors (via `dtaidistance`)

Hidden Markov Models (via `hmmlearn`)