https://github.com/habedi/feature-factory

A feature engineering library for Rust πŸ¦€ with Python bindings 🐍

https://github.com/habedi/feature-factory

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • β—‹
    CITATION.cff file
  • βœ“
    codemeta.json file
    Found codemeta.json file
  • βœ“
    .zenodo.json file
    Found .zenodo.json file
  • β—‹
    DOI references
  • β—‹
    Academic publication links
  • β—‹
    Committers with academic emails
  • β—‹
    Institutional organization owner
  • β—‹
    JOSS paper metadata
  • β—‹
    Scientific vocabulary similarity
    Low similarity (13.1%) to scientific vocabulary

Keywords

data-preprocessing data-science feature-engineering feature-selection machine-learning python python-library rust rust-library
Last synced: 5 months ago · JSON representation

Repository

A feature engineering library for Rust πŸ¦€ with Python bindings 🐍

Basic Info
  • Host: GitHub
  • Owner: habedi
  • License: apache-2.0
  • Language: Rust
  • Default Branch: main
  • Homepage:
  • Size: 145 KB
Statistics
  • Stars: 16
  • Watchers: 1
  • Forks: 0
  • Open Issues: 3
  • Releases: 0
Topics
data-preprocessing data-science feature-engineering feature-selection machine-learning python python-library rust rust-library
Created 12 months ago · Last pushed 6 months ago
Metadata Files
Readme Contributing License Code of conduct

README.md

Weldon the Penguin

Feature Factory

Tests Lints Code Coverage CodeFactor Crates.io Docs.rs Downloads MSRV License Status: Alpha

Feature Factory is a feature engineering library for Rust built on top of Apache DataFusion. It uses DataFusion internally for fast, in-memory data processing. It is inspired by the Feature-engine Python library and provides a wide range of components (referred to as transformers) for common feature engineering tasks like imputation, encoding, discretization, and feature selection.

Feature Factory aims to be feature-rich and provide an API similar to Scikit-learn, with the performance benefits of Rust and Apache DataFusion. Feature Factory transformers follow a fit-transform paradigm, where each transformer provides a constructor, a fit method, and a transform method. Given an input dataframe, a transformer applies a transformation to the data and returns a new dataframe. The library also provides a pipeline API that allows users to chain multiple transformers together to create data transformation pipelines for feature engineering.

[!IMPORTANT] Feature Factory is currently in the early stage of development. APIs are unstable and may change without notice. Inconsistencies in documentation are expected, and not all features have been implemented yet. It has not yet been thoroughly tested, benchmarked, or optimized for performance. Bug reports, feature requests, and contributions are welcome!

Features

  • High Performance: Feature Factory uses Apache DataFusion as the backend data processing engine.
  • Scikit-learn API: It provides a Scikit-learn-like API which is familiar to most data scientists.
  • Pipeline API: Users can chain multiple transformers together to build a feature engineering pipeline.
  • Large Set of Transformers: Currently, Feature Factory includes the following transformers:

| Task | Transformers | Status | |-------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------| | Imputation | - MeanMedianImputer: Replace missing values with the mean (or median).
- ArbitraryNumberImputer: Replace missing values with an arbitrary number.
- EndTailImputer: Replace missing values with values at distribution tails.
- CategoricalImputer: Replace missing values with an arbitrary string or most frequent category.
- AddMissingIndicator: Add a binary indicator for missing values.
- DropMissingData: Remove rows with missing values. | Tested | | Categorical Encoding | - OneHotEncoder: Perform one-hot encoding.
- CountFrequencyEncoder: Replace categories with their frequencies.
- OrdinalEncoder: Replace categories with ordered numbers.
- MeanEncoder: Replace categories with target mean.
- WoEEncoder: Replace categories with the weight of evidence.
- RareLabelEncoder: Group infrequent categories. | Tested | | Variable Discretization | - ArbitraryDiscretizer: Discretize based on user-defined intervals.
- EqualFrequencyDiscretizer: Discretize into equal-frequency bins.
- EqualWidthDiscretizer: Discretize into equal-width bins.
- GeometricWidthDiscretizer: Discretize into geometric intervals. | Tested | | Outlier Handling | - ArbitraryOutlierCapper: Cap outliers at user-defined bounds.
- Winsorizer: Cap outliers using percentile thresholds.
- OutlierTrimmer: Remove outliers from the dataset. | Tested | | Numerical Transformations | - LogTransformer: Apply logarithmic transformation.
- LogCpTransformer: Apply log transformation with a constant.
- ReciprocalTransformer: Apply reciprocal transformation.
- PowerTransformer: Apply power transformation.
- BoxCoxTransformer: Apply Box-Cox transformation.
- YeoJohnsonTransformer: Apply Yeo-Johnson transformation.
- ArcsinTransformer: Apply arcsin transformation. | Tested | | Feature Creation | - MathFeatures: Create new features with mathematical operations.
- RelativeFeatures: Combine features with reference features.
- CyclicalFeatures: Encode cyclical features using sine or cosine. | Tested | | Datetime Features | - DatetimeFeatures: Extract features from datetime values.
- DatetimeSubtraction: Compute time differences between datetime values. | Tested | | Feature Selection | - DropFeatures: Drop specific features.
- DropConstantFeatures: Remove constant and quasi-constant features.
- DropDuplicateFeatures: Remove duplicate features.
- DropCorrelatedFeatures: Remove highly correlated features.
- SmartCorrelatedSelection: Select the best features from correlated groups.
-DropHighPSIFeatures: Drop features based on Population Stability Index (PSI).
- SelectByInformationValue: Select features based on information value.
- SelectBySingleFeaturePerformance: Select features based on univariate estimators.
- SelectByTargetMeanPerformance: Select features based on target mean encoding.
- MRMR: Select features using Maximum Relevance Minimum Redundancy. | Tested |

[!NOTE] Status shows whether the module is Tested (unit, integration, and documentation tests) and Benchmarked. Empty status means the module has not yet been tested and benchmarked.

Installation

shell cargo add feature-factory

Or add this to your Cargo.toml:

toml [dependencies] feature-factory = "0.1"

Feature Factory requires Rust 1.83 or later.

Documentation

You can find the latest API documentation at docs.rs/feature-factory.

Architecture

The main building blocks of Feature Factory are transformers and pipelines.

Transformers

A transformer takes one or more columns from an input DataFrame and creates new columns based on a transformation. Transformers can be stateful or stateless:

  • A stateful transformer needs to learn one or more parameters from the data during training (via calling fit) before it can transform the data. A stateful transformer with learned parameters is referred to as a fitted transformer.
  • A Stateless transformer can directly transform the data without needing to learn any parameters.

All transformers implement the Transformer trait, which includes:

| Method | Description | |---------------|-----------------------------------------------------------------------------------------------------| | new | Creates a new transformer instance. Can accept hyperparameters and column names as input arguments. | | fit | Learns parameters from data. For stateless transformers this is a no-op. | | transform | Applies the transformation to data. Stateful transformers require calling fit first. | | is_stateful | Returns true if the transformer is stateful, otherwise false. |

The figure below shows a high-level overview of how a single Feature Factory transformer works:

Feature Factory Transformer

[!IMPORTANT] In most cases, to avoid data leakage, the data used for training a transformer must not be the same as the data that is going to be transformed.

Pipelines

A pipeline chains multiple transformers together. Pipelines are created using the make_pipeline macro, which accepts a list of (name, transformer) tuples. Stateful transformers must be fitted before they're used in a pipeline.

The figure below shows a high-level overview of how a Feature Factory pipeline works:

Feature Factory Pipeline

[!IMPORTANT] Currently, to use a stateful transformer in a pipeline, it must be already fitted.

Examples

Check out the examples and tests directories for examples of how to use Feature Factory.

Contributing

See CONTRIBUTING.md for details on how to make a contribution.

Logo

The mascot of this project is named "Weldon the Penguin". He is a Rustacean penguin who loves to swim in the sea and play video gamesβ€”and is always ready to help you with your data.

The logo was created using Gimp, ComfyUI, and a Flux Schnell v2 model.

Licensing

Feature Factory is available under the terms of either of these licenses:

Owner

  • Name: Hassan Abedi
  • Login: habedi
  • Kind: user
  • Location: Trondheim; Norway

GitHub Events

Total
  • Issues event: 3
  • Watch event: 13
  • Delete event: 4
  • Public event: 1
  • Push event: 150
  • Pull request event: 4
  • Create event: 7
Last Year
  • Issues event: 3
  • Watch event: 13
  • Delete event: 4
  • Public event: 1
  • Push event: 150
  • Pull request event: 4
  • Create event: 7

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 5
  • Total Committers: 1
  • Avg Commits per committer: 5.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 5
  • Committers: 1
  • Avg Commits per committer: 5.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Hassan Abedi h****t@g****m 5

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 3
  • Total pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Total issue authors: 1
  • Total pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • habedi (3)
Pull Request Authors
  • habedi (3)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cargo 1,087 total
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 2
  • Total maintainers: 1
crates.io: feature-factory

A high-performance feature engineering library for Rust powered by Apache DataFusion

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 1,087 Total
Rankings
Dependent repos count: 22.9%
Dependent packages count: 30.3%
Average: 49.8%
Downloads: 96.2%
Maintainers (1)
Last synced: 6 months ago