MLxtend
MLxtend: Providing machine learning and data science utilities and extensions to Python's scientific computing stack - Published in JOSS (2018)
PyCM
PyCM: Multiclass confusion matrix library in Python - Published in JOSS (2018)
Learning from Crowds with Crowd-Kit
Learning from Crowds with Crowd-Kit - Published in JOSS (2024)
PyClustering
PyClustering: Data Mining Library - Published in JOSS (2019)
NiaARM
NiaARM: A minimalistic framework for Numerical Association Rule Mining - Published in JOSS (2022)
WordTokenizers.jl
WordTokenizers.jl: Basic tools for tokenizing natural language in Julia - Published in JOSS (2020)
HiPart
HiPart: Hierarchical Divisive Clustering Toolbox - Published in JOSS (2023)
scikit-hubness
scikit-hubness: Hubness Reduction and Approximate Neighbor Search - Published in JOSS (2020)
latentcor
latentcor: An R Package for estimating latent correlations from mixed data types - Published in JOSS (2021)
pypots
A Python toolkit/library for reality-centric machine/deep learning and data mining on partially-observed time series, including SOTA neural network models for scientific analysis tasks of imputation/classification/clustering/forecasting/anomaly detection/cleaning on incomplete industrial (irregularly-sampled) multivariate TS with NaN missing values
lexicalrichness
:smile_cat: :speech_balloon: A module to compute textual lexical richness (aka lexical diversity).
lightgbm
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://github.com/alan-turing-institute/clevercsv
CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.
pyprobables
Probabilistic data structures in python http://pyprobables.readthedocs.io/en/latest/index.html
PySPOD
PySPOD: A Python package for Spectral Proper Orthogonal Decomposition (SPOD) - Published in JOSS (2021)
sport-activities-features
A minimalistic toolbox for extracting features from sports activity files written in Python
smart-agriculture-datasets
https://github.com/chaoss/grimoirelab
GrimoireLab: platform for software development analytics and insights
scibasic
sciBASIC# is a kind of dialect language which is derive from the native VB.NET language, and written for the data scientist.
lasio
Python library for reading and writing well data using Log ASCII Standard (LAS) files
https://github.com/chaoss/grimoirelab-perceval
Send Sir Perceval on a quest to retrieve and gather data from software repositories.
https://github.com/yzhao062/pyod
A Python Library for Outlier and Anomaly Detection, Integrating Classical and Deep Learning Techniques
CAZy-parser a way to extract information from the Carbohydrate-Active enZYmes Database
CAZy-parser a way to extract information from the Carbohydrate-Active enZYmes Database - Published in JOSS (2016)
genieclust
Genie: Fast and Robust Hierarchical Clustering with Noise Point Detection - in Python and R
https://github.com/firefly-cpp/uarmsolver
universal Association Rule Mining Solver
automlpipeline.jl
A package that makes it trivial to create and evaluate machine learning pipeline architectures.
@stdlib/datasets-fivethirtyeight-ffq
FiveThirtyEight reader responses to a food frequency questionnaire (FFQ).
https://github.com/ermshaua/claspy
ClaSPy: A Python package for time series segmentation.
https://github.com/business-science/timetk
Time series analysis in the `tidyverse`
https://github.com/matrix-profile-foundation/matrixprofile
A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms, accessible to everyone.
https://github.com/harrymvr/absorbing-centrality
An implementation of the absorbing random-walk centrality
https://github.com/centrefordigitalhumanities/exploring-culture-through-data
Course materials for Summer School by Data School
https://github.com/alibaba/alink
Alink is the Machine Learning algorithm platform based on Flink, developed by the PAI team of Alibaba computing platform.
https://github.com/baggepinnen/matrixprofile.jl
Time-series analysis using the Matrix profile in Julia
https://github.com/erictleung/ml-portfolio
:book: Experiment with various machine learning algorithms on various data sets from the University of California, Irvine (UCI) Machine Learning Repository (http://archive.ics.uci.edu/ml/index.html)
pygrinder
PyGrinder: a Python toolkit for grinding data beans into the incomplete for real-world data simulation by introducing missing values with different missingness patterns, including MCAR (complete at random), MAR (at random), MNAR (not at random), sub sequence missing, and block missing
arctic3d
Automatic Retrieval and ClusTering of Interfaces in Complexes from 3D structural information
TSrepr R package
TSrepr R package: Time Series Representations - Published in JOSS (2018)
https://github.com/cvjena/libmaxdiv
Implementation of the Maximally Divergent Intervals algorithm for Anomaly Detection in multivariate spatio-temporal time-series.
awesome-arm-in-smart-agriculture
A collection of literature on the use of association rule mining methods in smart agriculture
https://github.com/emptymalei/mini-lab
Some code snippets used to explain stuff to myself in my personal data science wiki
https://github.com/desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
https://github.com/grrvlr/tsmd
The TSMD project brings together Motif Discovery methods for Time Series, aiming to compare their performance through well-defined research questions and to simplify their practical use. It provides both guidelines for selecting the most suitable methods based on the data, and accessible implementations of the most relevant approaches.
tsb-uad
An End-to-End Benchmark Suite for Univariate Time-Series Anomaly Detection
core-periphery-hypergraphs
[KDD 2022] Official Code Release for "Core-periphery Models for Hypergraphs"
https://github.com/cn-tu/py-outlier-detection-stream-data
Outlier Detection in Stream Data with Python. CN contact: Félix Iglesias
https://github.com/avallecam/cdcper
Miscelanea de funciones customizadas a tareas de análisis en CDC Perú
awesome-production-machine-learning
A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning
https://github.com/equinor/timeseriesanalysis
Library that combines control engineering, dynamic simulation and machine learning on time-series. Developed to describe industrial processes and -automation. Lightweight, robust and fast for use in advanced analytics. Built on .NET to run anywhere.
https://github.com/centrefordigitalhumanities/gabber
A project for the Data School.
https://github.com/amr-yasser226/data-mining-and-information-retrieval
Revision notes and MCQs for DSAI 201 – Data Mining and Information Retrieval. Includes lecture summaries, algorithm overviews, and practice questions to support course preparation and review.
nuggets
R package for searching of patterns in subspaces described with elementary conjunctions
https://github.com/robelgium/msnoise
A Python Package for Monitoring Seismic Velocity Changes using Ambient Seismic Noise | http://www.msnoise.org
tsdb
a Python toolbox loads 172 public time series datasets for machine/deep learning with a single line of code. Datasets from multiple domains including healthcare, financial, power, traffic, weather, and etc.