Feature-engine
Feature-engine: A Python package for feature engineering for machine learning - Published in JOSS (2021)
tpot
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
geometricus
A structure-based, alignment-free embedding approach for proteins. Can be used as input to machine learning algorithms.
https://github.com/epistasislab/tpot2
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
mljar-supervised
Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://github.com/featureform/featureform
The Virtual Feature Store. Turn your existing data infrastructure into a feature store.
zoish
Zoish is a Python package that streamlines machine learning by leveraging SHAP values for feature selection and interpretability, making model development more efficient and user-friendly
autofeat
Linear Prediction Model with Automated Feature Engineering and Selection Capabilities
https://github.com/abhayspawar/featexp
Feature exploration for supervised learning
https://github.com/apache/hamilton
Apache Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
https://github.com/adosar/moxel
Python package for parallel calculation of energy voxels.
ballet
☀️🦶 A lightweight framework for collaborative, open-source feature engineering
upgini
Data search & enrichment library for Machine Learning → Easily find and add relevant features to your ML & AI pipeline from hundreds of public and premium external data sources, including open & commercial LLMs
the-building-data-genome-project
A collection of non-residential buildings for performance analysis and algorithm benchmarking
https://github.com/csinva/disentangled-attribution-curves
Using / reproducing DAC from the paper "Disentangled Attribution Curves for Interpreting Random Forests and Boosted Trees"
https://github.com/habedi/feature-factory
A feature engineering library for Rust 🦀 with Python bindings 🐍
https://github.com/gperdrizet/ensembleset
Ensemble dataset generator for tabular data prediction and modeling projects.
https://github.com/predict-idlab/tsflex
Flexible time series feature extraction & processing
https://github.com/functime-org/functime
Time-series machine learning at scale. Built with Polars for embarrassingly parallel feature extraction and forecasts on panel data.
https://github.com/alibaba/alink
Alink is the Machine Learning algorithm platform based on Flink, developed by the PAI team of Alibaba computing platform.
https://github.com/alok-ai-lab/mrep-deepinsight
Multiple Representation DeepInsight technique
https://github.com/raptor-ml/raptor
Transform your pythonic research to an artifact that engineers can deploy easily.
https://github.com/ajayarunachalam/msda
Library for multi-dimensional, multi-sensor, uni/multivariate time series data analysis, unsupervised feature selection, unsupervised deep anomaly detection, and prototype of explainable AI for anomaly detector
https://github.com/chris-santiago/tsfeast
A collection of Scikit-Learn compatible time series transformers and tools.
https://github.com/andrei-vataselu/data-science-snippets
🧰 Essential EDA and Data Cleaning Helpers for Any DataFrame This collection of functions is designed to accelerate exploratory data analysis (EDA), quickly surface data quality issues, and offer high-level insights into the structure and content of your dataset.
https://github.com/amr-yasser226/machine-learning-for-network-intrusion-detection
A complete pipeline for network intrusion detection comparing label encoding and one‑hot encoding, with SMOTE resampling, feature selection, and ensemble modeling using scikit‑learn and XGBoost, also this was phase one of our University's "CSAI 253- Machine Learning" course.
https://github.com/atharvapathak/sales_forecasting_project
Forecasted product sales using time series models such as Holt-Winters, SARIMA and causal methods, e.g. Regression. Evaluated performance of models using forecasting metrics such as, MAE, RMSE, MAPE and concluded that Linear Regression model produced the best MAPE in comparison to other models
https://github.com/amr-yasser226/intrusion-detection-kaggle
End-to-end pipeline for multi-class cyber-attack detection using per-flow network features: data profiling, deduplication, skew-correction, outlier treatment, feature engineering, imbalance handling, and tree-based modeling (XGBoost, LightGBM, CatBoost, stacking), with a final Kaggle submission scoring 0.9146 public / 0.9163 private.
https://github.com/csinva/transformation-importance
Using / reproducing TRIM from the paper "Transformation Importance with Applications to Cosmology" 🌌 (ICLR Workshop 2020)
fedora-framework
The Fedora Framework is an evolutionary feature engineering framework designed to optimize features for machine learning tasks
asaca-automatic-speech-analysis-for-cognitive-assessment
The automatic system that can extract PRAAT-like speech features from raw speech wav files, and also can get low WER (<10) high quality transcriptions at the same time.
https://github.com/desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
behavioralproject
Classify TD vs ASD according to SRS behavioral report severity score. ABIDE II data set is utilized for training and testing. Freesurfer v6 is utilized for sMRI volumes preprocessing and features extraction.
asaca-automatic-speech-analysis-for-cognitive-assessment
Transform speech into cognitive assessments with ASACA. Achieve accurate predictions and low error rates using our end-to-end toolkit. 🚀🔧
breast_cancer_diagnosis_ml
This project demonstrates the use of machine learning models to predict breast cancer diagnoses. The repository covers the entire workflow from data preprocessing and feature engineering to model training and evaluation, providing insights into diagnosis prediction with various ML models.
context-engineering
Explore cutting-edge research in context engineering with insights from top institutions. Enhance AI performance with practical techniques. 🌟📂
https://github.com/dmdequin/airbnb_price_predict
Machine Learning and NLP to predict Airbnb prices
https://github.com/data-prompt-query/dpq
dpq is an open-source python library that makes prompt-based data transformations and feature engineering easy
learning-representations-causal-inference
Code supplement for "Neuroevolutionary representations for learning heterogeneous treatment effects"