simpleml

Machine learning that just works, for effortless production applications

https://github.com/eyadgaran/simpleml

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 6 committers (16.7%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Machine learning that just works, for effortless production applications

Basic Info

Host: GitHub
Owner: eyadgaran
License: bsd-3-clause
Language: Python
Default Branch: master
Homepage:
Size: 980 KB

Statistics

Stars: 17
Watchers: 2
Forks: 5
Open Issues: 53
Releases: 19

Created over 8 years ago · Last pushed over 3 years ago

Metadata Files

Readme Changelog Funding License Citation

SimpleML

Machine learning that just works, for effortless production applications

Documentation: simpleml.readthedocs.io

Installation: pip install simpleml

History

SimpleML started as a persistence solution to simplify some of the most common pain points in new modeling projects. It offered an abstraction layer to implicitly version, persist, and load training iterations to make productionalizing a project an effortless process. Extensibility is central to the design and is therefore compatible out of the box with modeling libraries (Scikit-Learn, Tensorflow/Keras, etc) and algorithms, making it a low overhead drop-in complement to workflows.

As the ML ops space has grown, more solutions are being offered to manage particular pain points or conform to opinionated development and release patterns. These patterns while immensely powerful are rigid and not always the ideal implementation for projects (and definitely not amenable to blending multiple frameworks in a project). SimpleML is also growing to address this gap by evolving from a persistence framework to an ML management framework. The goal is to unify existing and new solutions into a standardized ecosystem giving developers the ease and flexibility to choose the right fit(s) for their projects. Like before, SimpleML does and will not define modeling algorithms, instead it will focus on perfecting the glue that allows those algorithms and now solutions to be used in real workflows that can be effortlessly deployed into real applications.

Architecture

Architecturally, SimpleML has a core set of components that map to the areas of ML management. Each of those in turn is extended and refined to support external libraries, tools, and infrastructure. Extensibility is the cornerstone for SimpleML and support for new extensions should be a simple, straightforward process without ever requiring monkey-patching.

Components - Persistables: Standardization - Executors: Portability, Scale - Adapters: Interoperatability - ORM: Versioning, Lineage, Metadata Tracking, Reusability - Save Patterns: Persistence - Registries: Extensibility

SimpleML core acts as the glue code binding all of the components into a seamless workflow, but all of the components can also be used independently to leverage portions of the abstraction. See the docs/source code for detailed instructions on each component or file a request for examples for a particular implementation.

Persistables

Persistables are the wrappers around artifacts (artifacts are the actual objects that are generated by training and need to be deployed into production). They provide a standardized interface to manage and use artifacts making it easy to use artifacts from different libraries inside the same processing environment. Additionally they allow for a unified mapping of the particular idiosyncrasies that come with different frameworks to enable developers and scripts to only use a single access pattern (eg always call "fit" instead of mapping between fit, train, etc based on library). See the source code for the inheritance pattern and examples to extend around any external library.

Executors

Executors are the persistable agnostic components that provide portability and scale. They handle execution so the same functions can be run in various backends without affecting the artifacts produced. (examples: single process execution, multiprocessing, threading, containers, kubernetes, dask, ray, apache-beam, spark, etc). This intentional decoupling is a large part of what powers the diverse support for flexible productionalization (train once, deploy anywhere). Note that not every execution pattern is guaranteed to work natively with every persistable (these will be noted as needed).

Adapters

Adapters are the complements to persistables and executors. They are optional wrappers to align input requirements to operations. By definition adapters are stateless wrappers that have no functional impact on processing so they can be specified at runtime as needed. Additionally the output across different executors for the same operation is guaranteed to be identical. (eg creating a docker container for a persistable to run in kubernetes or wrapping a persistable in a ParDo to execute in apache-beam)

ORM

The ORM layer is the heart of metadata management. All persistables are integrated with the database to record specifications for reproducibility, lineage, and versioning. Depending on the workflows, that metadata can also be leveraged for reusability to accelerate development iterations by only creating new experiments and reusing old persistables for existing ones.

Save Patterns

Save and load patterns are the mechanism that manage persistence. All artifacts can be different with native or special handling of serialization to save the training state to be loaded into a production environment. Save patterns allow for that customization to register any serialization/deserialization technique that will automatically be applied by the persistables. (examples: pickle, hickle, hdf5, json, library native, database tables, etc)

Registries

Registries are the communication backend that allows users to change internal behavior or extend support at runtime. Registration can happen implicitly on import or explicitly as part of a script. (eg register serialization class for a save pattern or map an executor class to a particular backend parameter)

Workflows

Workflows are largely up to individual developers, but there are some assumptions made about the process:

The primary assumption is that the ML lifecycle follows a DAG. That creates a forward propagating dependency chain without altering previous pieces of the chain. There is considerable flexibility in what each of the steps can be, but are generally assumed to flow modularly and mimic a data science project.

Thematic steps, in sequence, start with data management, move through transformation, model creation, and finally evaluation. These are further broken down in the following ways:

Data Management - Raw Datasets: The basic data block of (potentially) unformatted datasets. These datasets can be sourced from anywhere - Dataset Pipelines: The required transformation to turn unformatted data into what is expected to be seen in production -- These pipelines are completely optional and only used in derived datasets - Datasets: The "production formatted" datasets

Transformation - Pipelines: Transformation sequences to extract and process the dataset

Modeling - Models: The machine learning models

Evaluation - Metrics: Evaluation objects computed over the models and datasets

Examples

Examples will be posted in response to requests under Examples. Please open an issue on github to request more examples and tag them with [Example Request] How to...

Usage

Starting a project is as simple as defining the raw data and guiding the transformations. A minimal example using the kaggle Titanic dataset is demonstrated below:

The first step in every project is to establish a database connection to manage metadata. Technically this step is only necessary if a persistable is saved or loaded, so ephemeral setups can skip this.

```python from simpleml.utils import Database

Initialize Database Connection and upgrade the schema to the latest

By default this will create a new local sqlite database

The upgrade parameter is only necessary if the db is outdated or new

db = Database().initialize(upgrade=True) ```

The most direct way to use SimpleML is to treat it like other modeling frameworks with forward moving imperative actions (eg initialize, run methods, save). Notice how this workflow is identical to using the underlying libraries directly with a few additional parameters. That is because SimpleML wraps the underlying libraries and standardizes the interfaces.

This block (or any subset) can be executed as many times as desired and will create a new object each time with an autoincrementing version (for each "name").

```python from simpleml.constants import TEST_SPLIT from simpleml.datasets.pandas import PandasFileBasedDataset from simpleml.metrics import AccuracyMetric from simpleml.models import SklearnLogisticRegression from simpleml.pipelines.sklearn import RandomSplitSklearnPipeline from simpleml.transformers import ( DataframeToRecords, FillWithValue, SklearnDictVectorizer, )

Create Dataset and save it

dataset = PandasFileBasedDataset( name="titanic", filepath="filepath/to/train.csv", format="csv", labelcolumns=["Survived"], squeezereturn=True, ) dataset.build_dataframe() dataset.save() # this defaults to a pickle serialization

Define the minimal transformers to fill nulls and one-hot encode text columns

transformers = [ ("fillzeros", FillWithValue(values=0.0)), ("recordcoverter", DataframeToRecords()), ("vectorizer", SklearnDictVectorizer()), ]

Create Pipeline and save it - Use basic 80-20 test split

Creates an sklearn.pipelines.Pipeline artifact

pipeline = RandomSplitSklearnPipeline( name="titanic", transformers=transformers, trainsize=0.8, validationsize=0.0, testsize=0.2, ) pipeline.adddataset(dataset) # adds a lineage relationship pipeline.fit() # automatically uses relationship and parameters to choose data pipeline.save() # this defaults to a pickle serialization

Create Model and save it -- Creates an sklearn.linear_model.LogisticRegression artifact

model = SklearnLogisticRegression(name="titanic") model.add_pipeline(pipeline) # adds a lineage relationship model.fit() # automatically uses relationship to choose data model.save() # this defaults to a pickle serialization

Create Metric and save it

metric = AccuracyMetric(datasetsplit=TESTSPLIT) metric.addmodel(model) metric.adddataset(dataset) metric.score() metric.save() ```

The same operations can also be defined in a declaritive way using wrapper utilities so only the parameters need to be specified. Additionally if using a deterministic persistable wrapper (the object is fully initialized at construction and not subject to user changes) the metadata automatically generated can be used to identify existing artifacts without having to recreate them.

```python from simpleml.utils import DatasetCreator, MetricCreator, ModelCreator, PipelineCreator

----------------------------------------------------------------------------

Option 1: Explicit object creation (pass in dependencies)

----------------------------------------------------------------------------

Object defining parameters

datasetkwargs = { "name": "titanic", "registeredname": "PandasFileBasedDataset", "filepath": "filepath/to/train.csv", "format": "csv", "labelcolumns": ["Survived"], "squeezereturn": True, } pipelinekwargs = { "name": "titanic", "registeredname": "RandomSplitSklearnPipeline", "transformers": transformers, "trainsize": 0.8, "validationsize": 0.0, "testsize": 0.2, } modelkwargs = {"name": "titanic", "registeredname": "SklearnLogisticRegression"} metrickwargs = {"registeredname": "AccuracyMetric", "datasetsplit": TEST_SPLIT}

each creator has two methods - `retrieve_or_create` and `create`. using create will

create a new persistable each time while retrieveorcreate will first look for a matching persistable

dataset = DatasetCreator.retrieveorcreate(*datasetkwargs) pipeline = PipelineCreator.retrieveor_create(dataset=dataset, *pipelinekwargs) model = ModelCreator.retrieveorcreate(pipeline=pipeline, **modelkwargs) metric = MetricCreator.retrieveorcreate(model=model, dataset=dataset, **metric_kwargs)

----------------------------------------------------------------------------

Option 2: Implicit object creation (pass in dependency references - nested)

Does not require dependency existence at this time, good for compiling job definitions and executing on remote, distributed nodes

----------------------------------------------------------------------------

Nested dependencies

pipelinekwargs["datasetkwargs"] = datasetkwargs modelkwargs["pipelinekwargs"] = pipelinekwargs metrickwargs["modelkwargs"] = model_kwargs

each creator has two methods - `retrieve_or_create` and `create`. using create will

create a new persistable each time while retrieveorcreate will first look for a matching persistable

dataset = DatasetCreator.retrieveorcreate(*datasetkwargs) pipeline = PipelineCreator.retrieveorcreate( datasetkwargs=dataset_kwargs, *pipelinekwargs ) model = ModelCreator.retrieveorcreate(pipelinekwargs=pipelinekwargs, **modelkwargs) metric = MetricCreator.retrieveorcreate( modelkwargs=modelkwargs, datasetkwargs=datasetkwargs, **metric_kwargs ) ```

This workflow is modeled as a DAG, which means that there is room for parallelization, but dependencies are assumed to exist upon execution. Persistable creators are intentionally designed to take a dependent object as input or a reference. This allows for job definition before dependencies exist with lazy loading when they are required. Note that this comes at the cost of additional computations. In order to match up dependencies to a reference, a dummy persistable must be created and compared, with the exception of a unique reference - like name, version which mean the dependency already exists but is memory efficient to load later. This form also enables usage of config files to parameterize training instead of requiring an active shell to interactively define the objects.

Once artifacts have been created, they can be easily retrieved by their name attribute (or any other identifying metadata). By default the latest version for the supplied parameters will be returned, but this can be overridden by explicitly passing a version number. This makes productionalization as simple as defining a deployment harness to process new requests.

```python from simpleml.utils import PersistableLoader

Notice versions are not shared between persistable types and can increment differently depending on iterations

dataset = PersistableLoader.loaddataset(name="titanic", version=7) pipeline = PersistableLoader.loadpipeline(name="titanic", version=6) model = PersistableLoader.loadmodel(name="titanic", version=8) metric = PersistableLoader.loadmetric( name="classificationaccuracy", modelid=model.id ) ```

When it comes to production, typically the training data is no longer needed so this mechanism becomes as simple as loading the feature pipeline and model:

```python desiredmodel = PersistableLoader.loadmodel(name="titanic", version=10)

Implicitly pass new data through linked pipeline via transform param

desiredmodel.predictproba(new_dataframe, transform=True) ```

or (explicitly load a pipeline to use, by default the pipeline the model was trained on will be used)

python desired_pipeline = PersistableLoader.load_pipeline(name="titanic", version=11) desired_model = PersistableLoader.load_model(name="titanic", version=10) desired_model.predict_proba(desired_pipeline.transform(new_dataframe), transform=False)

The Vision

Ultimately SimpleML should fill the void currently faced by many data scientists with a simple and painless management layer. Furthermore it will be extended in a way that lowers the technical barrier for all developers to use machine learning in their projects. If it resonates with you, consider opening a PR and contributing!

Future features I would like to introduce: - Browser GUI with drag-n-drop components for each step in the process (click on a dataset, pile transformers as blocks, click on a model type...) - App-Store style tabs for community shared persistables (datasets, transformers...) - Automatic API hosting for models (click "deploy" for REST API)

Support

SimpleML is a community project, developed on the side, to address a lot of the pain points I have felt creating ML applications. If you find it helpful and would like to support further development, please consider becoming a Sponsor :heart: or opening a PR.

Contract & Technical Support

For support implementing and extending SimpleML or architecting a machine learning tech stack, contact the author Elisha Yadgaran :email: for rates.

Enterprise Support

There is a vision to eventually offer a managed enterprise version, but that is not being pursued at the moment. Regardless of that outcome, SimpleML will always stay an open source framework and offer a self-hosted version.

Owner

Name: Elisha Yadgaran
Login: eyadgaran
Kind: user
Company: Palo Alto Networks

Website: yadgaran.net
Repositories: 10
Profile: https://github.com/eyadgaran

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite it using these metadata.
title: SimpleML
abstract: An ML Management framework for effortless production applications
authors:
  - family-names: Yadgaran
    given-names: Elisha
    orcid: "https://orcid.org/0000-0002-0444-6451"
version: 0.11.0
date-released: "2021-10-10"
identifiers:
  - type: url
    value: "https://github.com/eyadgaran/SimpleML/releases/tag/0.10.0"
    description: "The GitHub release URL of tag 0.10.0."
  - type: url
    value: "https://github.com/eyadgaran/SimpleML/releases/tag/0.11.0"
    description: "The GitHub release URL of tag 0.11.0."
license: BSD-3-Clause
license-url: "https://github.com/eyadgaran/SimpleML/blob/master/LICENSE"
repository-code: "https://github.com/eyadgaran/SimpleML"

GitHub Events

Total

Last Year

Committers

Last synced: over 3 years ago

All Time

Total Commits: 343
Total Committers: 6
Avg Commits per committer: 57.167
Development Distribution Score (DDS): 0.19

Top Committers

Name	Email	Commits
Elisha Yadgaran	E**Y@a**u	278
Elisha Yadgaran	e**n@u**m	59
dependabot[bot]	4**]@u**m	3
Pamela Toman	p**n@u**m	1
aolopez	l**o@p**m	1
ptoman-pa	9**a@u**m	1

Committer Domains (Top 20 + Academic)

alum.mit.edu: 1

Issues and Pull Requests

Last synced: almost 2 years ago

All Time

Total issues: 38
Total pull requests: 62
Average time to close issues: 7 months
Average time to close pull requests: about 1 month
Total issue authors: 1
Total pull request authors: 4
Average comments per issue: 0.21
Average comments per pull request: 0.97
Merged pull requests: 33
Bot issues: 0
Bot pull requests: 25

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

eyadgaran (38)

Pull Request Authors

eyadgaran (35)
dependabot[bot] (25)
ptoman-pa (1)
aolopez (1)

Top Labels

Issue Labels

help wanted (4) bug (3) Core (3) good first issue (2) External Wrappers (2) DB Layer (1)

Pull Request Labels

dependencies (25)

Packages

Total packages: 1
Total downloads:
- pypi 468 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 20
Total maintainers: 1

pypi.org: simpleml

Simplified Machine Learning

Homepage: https://github.com/eyadgaran/SimpleML
Documentation: https://simpleml.readthedocs.io/
License: BSD-3
Latest release: 0.14.0
published almost 4 years ago

Versions: 20
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 468 Last month

Rankings

Downloads: 9.9%

Dependent packages count: 10.1%

Average: 14.0%

Forks count: 14.2%

Stargazers count: 14.3%

Dependent repos count: 21.6%

Maintainers (1)

eyadgaran

Last synced: 10 months ago

Dependencies

docs/requirements.txt pypi

Babel ==2.9.1
HeapDict ==1.0.1
Jinja2 ==3.0.3
Keras-Preprocessing ==1.1.2
Mako ==1.1.6
Markdown ==3.3.6
MarkupSafe ==2.0.1
Pillow ==9.0.0
PyNaCl ==1.5.0
PyYAML ==6.0
Pygments ==2.11.2
SQLAlchemy ==1.4.31
Sphinx ==4.4.0
Unidecode ==1.3.2
Werkzeug ==2.0.2
absl-py ==1.0.0
alabaster ==0.7.12
alembic ==1.7.5
apache-libcloud ==3.4.1
astroid ==2.9.3
astunparse ==1.6.3
bcrypt ==3.2.0
bokeh ==2.4.2
cachetools ==5.0.0
certifi ==2021.10.8
cffi ==1.15.0
charset-normalizer ==2.0.10
click ==8.0.3
cloudpickle ==2.0.0
configparser ==5.2.0
cryptography ==36.0.1
dask ==2022.1.0
dill ==0.3.4
distributed ==2022.1.0
docutils ==0.17.1
flatbuffers ==2.0
fsspec ==2022.1.0
future ==0.18.2
gast ==0.4.0
google-auth ==2.4.1
google-auth-oauthlib ==0.4.6
google-pasta ==0.2.0
greenlet ==1.1.2
grpcio ==1.43.0
h5py ==2.10.0
hickle ==4.0.4
idna ==3.3
imagesize ==1.3.0
importlib-metadata ==4.10.1
importlib-resources ==5.4.0
joblib ==1.1.0
keras ==2.7.0
lazy-object-proxy ==1.7.1
libclang ==12.0.0
locket ==0.2.1
msgpack ==1.0.3
numpy ==1.21.5
oauthlib ==3.1.1
onedrivesdk ==1.1.8
opt-einsum ==3.3.0
packaging ==21.3
pandas ==1.3.5
paramiko ==2.9.2
partd ==1.2.0
protobuf ==3.19.3
psutil ==5.9.0
psycopg2 ==2.9.3
pyarrow ==6.0.1
pyasn1 ==0.4.8
pyasn1-modules ==0.2.8
pycparser ==2.21
pycrypto ==2.6.1
pyparsing ==3.0.7
python-dateutil ==2.8.2
pytz ==2021.3
requests ==2.27.1
requests-oauthlib ==1.3.0
rsa ==4.8
scikit-learn ==1.0.2
scipy ==1.7.3
simplejson ==3.17.6
six ==1.16.0
snowballstemmer ==2.2.0
sortedcontainers ==2.4.0
sphinx-autoapi ==1.8.4
sphinx-rtd-theme ==1.0.0
sphinxcontrib-applehelp ==1.0.2
sphinxcontrib-devhelp ==1.0.2
sphinxcontrib-htmlhelp ==2.0.0
sphinxcontrib-jsmath ==1.0.1
sphinxcontrib-qthelp ==1.0.3
sphinxcontrib-serializinghtml ==1.1.5
sqlalchemy-json ==0.4.0
sqlalchemy-mixins ==1.5
sshtunnel ==0.4.0
tblib ==1.7.0
tensorboard ==2.8.0
tensorboard-data-server ==0.6.1
tensorboard-plugin-wit ==1.8.1
tensorflow ==2.7.0
tensorflow-estimator ==2.7.0
tensorflow-io-gcs-filesystem ==0.23.1
termcolor ==1.1.0
threadpoolctl ==3.0.0
toolz ==0.11.2
tornado ==6.1
typed-ast ==1.5.2
typing_extensions ==4.0.1
urllib3 ==1.26.8
wrapt ==1.13.3
zict ==2.0.0
zipp ==3.7.0

setup.py pypi

alembic *
click *
cloudpickle *
configparser *
future *
numpy *
simplejson *
sqlalchemy >=1.3.7
sqlalchemy-json *
sqlalchemy-mixins *

.github/workflows/lint.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

.github/workflows/package.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

.github/workflows/test.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite
postgres * docker

.github/workflows/type-check.yml actions

actions/checkout v1 composite
actions/setup-python v1 composite

simpleml

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

SimpleML

History

Architecture

Persistables

Executors

Adapters

ORM

Save Patterns

Registries

Workflows

Examples

Usage

Initialize Database Connection and upgrade the schema to the latest

By default this will create a new local sqlite database

The upgrade parameter is only necessary if the db is outdated or new

Create Dataset and save it

Define the minimal transformers to fill nulls and one-hot encode text columns

Create Pipeline and save it - Use basic 80-20 test split

Creates an sklearn.pipelines.Pipeline artifact

Create Model and save it -- Creates an sklearn.linear_model.LogisticRegression artifact

Create Metric and save it

----------------------------------------------------------------------------

Option 1: Explicit object creation (pass in dependencies)

----------------------------------------------------------------------------

Object defining parameters

each creator has two methods - retrieve_or_create and create. using create will

create a new persistable each time while retrieveorcreate will first look for a matching persistable

----------------------------------------------------------------------------

Option 2: Implicit object creation (pass in dependency references - nested)

Does not require dependency existence at this time, good for compiling job definitions and executing on remote, distributed nodes

----------------------------------------------------------------------------

Nested dependencies

each creator has two methods - retrieve_or_create and create. using create will

create a new persistable each time while retrieveorcreate will first look for a matching persistable

Notice versions are not shared between persistable types and can increment differently depending on iterations

Implicitly pass new data through linked pipeline via transform param

The Vision

Support

Contract & Technical Support

Enterprise Support

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: simpleml

Rankings

Maintainers (1)

Dependencies

each creator has two methods - `retrieve_or_create` and `create`. using create will

each creator has two methods - `retrieve_or_create` and `create`. using create will