sklearn-migrator: Cross-version migration of scikit-learn models for reproducible MLOps

sklearn-migrator: Cross-version migration of scikit-learn models for reproducible MLOps - Published in JOSS (2026)

https://github.com/anvaldes/sklearn-migrator

Last synced: about 2 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: anvaldes
License: mit
Language: Python
Default Branch: dev
Size: 983 KB

Statistics

Stars: 11
Watchers: 2
Forks: 2
Open Issues: 8
Releases: 2

Created about 1 year ago · Last pushed 2 months ago

Metadata Files

Readme Contributing License Code of conduct Citation Codeowners

sklearn-migrator 🧪

A Python library to serialize and migrate scikit-learn models across incompatible versions.

Python versions

sklearn-migrator

🚀 Motivation

Machine learning teams frequently store trained scikit-learn models using pickle or joblib.
However:

❌ These serialized models break when scikit-learn versions change

Internal attributes change
APIs evolve (e.g., affinity → metric)
Tree and boosting internals get reorganized
New default parameters appear

❌ This creates real problems:

Production services fail after dependency upgrades
Research becomes non-reproducible
Long-term model governance becomes impossible
Models can't be migrated or audited reliably

✅ What `sklearn-migrator` provides

✔ Serialize any supported model into a JSON-compatible dictionary

✔ Deserialize and reconstruct the model in a different scikit-learn version

✔ Remove dependency on pickle/joblib for long-term storage

✔ Enable reproducible ML pipelines across environments

This library has been validated across 1,024 version migration pairs (from → to), covering:

0.21.3 → 1.7.2

💡 Supported Models (21 models)

sklearn-migrator supports 21 core models across classification, regression, clustering, and dimensionality reduction.

📘 Classification

| Model | Supported | |------------------------------|-----------| | DecisionTreeClassifier | ✅ | | RandomForestClassifier | ✅ | | GradientBoostingClassifier | ✅ | | LogisticRegression | ✅ | | KNeighborsClassifier | ✅ | | SVC (Support Vector Classifier) | ✅ | | MLPClassifier | ✅ |

📗 Regression

| Model | Supported | |------------------------------|-----------| | DecisionTreeRegressor | ✅ | | RandomForestRegressor | ✅ | | GradientBoostingRegressor | ✅ | | LinearRegression | ✅ | | Ridge | ✅ | | Lasso | ✅ | | KNeighborsRegressor | ✅ | | SVR (Support Vector Regressor) | ✅ | | AdaBoostRegressor | ✅ | | MLPRegressor | ✅ |

📙 Clustering

| Model | Supported | |----------------------|-----------| | KMeans | ✅ | | MiniBatchKMeans | ✅ | | Agglomerative | ✅ |

📘 Dimensionality Reduction

| Model | Supported | |-------|-----------| | PCA | ✅ |

🔢 Version Compatibility Matrix

The library supports model migrations across the full matrix:

32 versions
1,024 migration pairs
Fully tested using automated environments via CI/CD on every push

python versions = [ '0.21.3', '0.22.0', '0.22.1', '0.23.0', '0.23.1', '0.23.2', '0.24.0', '0.24.1', '0.24.2', '1.0.0', '1.0.1', '1.0.2', '1.1.0', '1.1.1', '1.1.2', '1.1.3', '1.2.0', '1.2.1', '1.2.2', '1.3.0', '1.3.1', '1.3.2', '1.4.0', '1.4.2', '1.5.0', '1.5.1', '1.5.2', '1.6.0', '1.6.1', '1.7.0', '1.7.1', '1.7.2' ]

| From \ To | 0.21.3 | 0.22.0 | ... | 1.7.2 | | --------- | ------ | ------ | --- | ----- | | 0.21.3 | ✅ | ✅ | ... | ✅ | | 0.22.0 | ✅ | ✅ | ... | ✅ | | ... | ... | ... | ... | ... | | 1.7.2 | ✅ | ✅ | ... | ✅ |

📊 Validation Metric

Each migration pair (version_in, version_out) is validated using:

$$\max |y{\text{in}} - y{\text{out}}| < 10^{-2}$$

Where: - $y{\text{in}}$: predictions from the model in the source version - $y{\text{out}}$: predictions from the migrated model in the target version

The worst case across all 1,024 pairs is obtained via:

python df_performance.abs().max().max() # global worst case (32x32 matrix)

⚠️ All 1,024 combinations and 21 models are automatically tested on every push via CI/CD, using isolated Docker environments for each sklearn version. Each model is validated under a representative parameter configuration; exhaustive combinatorial testing of all parameter combinations is outside the current scope.

📂 Installation

bash pip install sklearn-migrator

📚 API Documentation

For full API documentation covering all 21 models, function signatures, parameters, return types, and usage examples, see API.md.

💥 Use Cases

Long-term model storage: Store models in a future-proof format across teams and systems.
Production model migration: Move models safely across major scikit-learn upgrades.
Auditing and inspection: Read serialized models as JSON, inspect structure, hyperparameters, and internals.
Cross-platform inference: Serialize in Python, serve elsewhere (e.g., microservices).

1. Using two python environments

You can serialize the model from an environment with a scikit-learn version (for example 1.5.0) and then deserialize the model from another environment with a different version (for example 1.7.0).

The deserialized model has the version of the environment where you deserialized it. In this case 1.7.0.

It is important to understand what version of scikit-learn you want to migrate from, and what version you want to migrate to, in order to create the appropriate environments for serialization and deserialization.

a. Serialize the model

```python import json import sklearn import numpy as np from sklearn.datasets import makeregression from sklearn.modelselection import traintestsplit from sklearn.ensemble import RandomForestRegressor from sklearnmigrator.regression.randomforestreg import serializerandomforestreg

X, y = makeregression(nsamples=200, nfeatures=10, randomstate=42) Xtrain, Xtest, ytrain, ytest = traintestsplit(X, y, testsize=0.2, randomstate=42)

model = RandomForestRegressor().fit(Xtrain, ytrain) predictions = model.predict(Xtest) data = serializerandomforestreg(model, sklearn.version)

with open("model.json", "w") as f: json.dump(data, f) ```

b. Deserialize the model

```python import json import sklearn from sklearnmigrator.regression.randomforestreg import deserializerandomforestreg

with open("model.json") as f: data = json.load(f)

newmodel = deserializerandomforestreg(data, sklearn.version) newpredictions = newmodel.predict(X_test) ```

2. Docker: Step by Step

You have a Random Forest Classifier saved in a .pkl format and it is called model.pkl. The version of this model is 1.5.0.

i. Create in your Desktop the next folder:

bash /test_github

And copy your model.pkl in this folder.

ii. The Dockerfiles and requirements for all supported input versions are available in the integration/environments/input/ directory of this repository. Copy the files for your input version (e.g., 1.5.0):

bash /test_github/input/1.5.0/Dockerfile_input /test_github/input/1.5.0/requirements_input.txt

iii. The Dockerfiles and requirements for all supported output versions are available in the integration/environments/output/ directory of this repository. Copy the files for your output version (e.g., 1.7.0):

bash /test_github/output/1.7.0/Dockerfile_output /test_github/output/1.7.0/requirements_output.txt

iv. Now you create your input.py:

```python import json import joblib import sklearn import numpy as np import pandas as pd from joblib import load

from sklearn.ensemble import RandomForestClassifier

from sklearnmigrator.classification.randomforestclf import serializerandomforestclf

versionsklearnin = sklearn.version

model = load('model.pkl')

alldata = serializerandomforestclf(model, versionsklearnin)

def convert(o): if isinstance(o, (np.integer, np.int64)): return int(o) elif isinstance(o, (np.floating, np.float64)): return float(o) elif isinstance(o, np.ndarray): return o.tolist() else: raise TypeError(f"Object of type {type(o).name} is not JSON serializable")

with open("inputmodel/alldata.json", "w") as f: json.dump(all_data, f, default=convert)

fake_row = np.array([[0.5, -1.2, 0.3, 1.1, -0.7, 0.9, 0.0, -0.3, 1.5, 0.2]])

ypred = pd.DataFrame(model.predictproba(fakerow)) ypred.tocsv('inputmodel/y_pred.csv', index=False) ```

v. Now you create your output.py:

```python import json import joblib import sklearn import numpy as np import pandas as pd from joblib import load

from sklearn.ensemble import RandomForestClassifier

from sklearnmigrator.classification.randomforestclf import deserializerandomforestclf

versionsklearnout = sklearn.version

with open("inputmodel/alldata.json", "r") as f: all_data = json.load(f)

newmodel = deserializerandomforestclf(alldata, versionsklearn_out)

joblib.dump(newmodel, 'outputmodel/new_model.pkl')

fake_row = np.array([[0.5, -1.2, 0.3, 1.1, -0.7, 0.9, 0.0, -0.3, 1.5, 0.2]])

yprednew = pd.DataFrame(newmodel.predictproba(fakerow)) yprednew.tocsv('outputmodel/ypred_new.csv', index=False) ```

vi. Now you copy all the files:

bash cp input/1.5.0/* output/1.7.0/* .

vii. Now you create two folders: input_model/ and output_model/.

viii. Execute the next commands in your terminal (you should be in the root of test_github/ folder):

```bash docker build -f Dockerfileinput -t imageinput1.5.0 . docker build -f Dockerfileoutput -t imageoutput1.7.0 .

docker run --rm \ -v "$(pwd)/inputmodel:/app/inputmodel" \ -v "$(pwd)/model.pkl:/app/model.pkl" \ imageinput1.5.0

docker run --rm \ -v "$(pwd)/inputmodel:/app/inputmodel" \ -v "$(pwd)/outputmodel:/app/outputmodel" \ imageoutput1.7.0 ```

ix. Finally you can find your migrated model in the folder /output_model and its name is new_model.pkl. This model is a scikit-learn model of version 1.7.0.

🔧 Development

Run tests locally

bash pytest tests/

Integration tests run automatically on every push via CI/CD.

🤝 Contributing

Fork the repository
Create a new branch feature/my-feature
Open a pull request
Please ensure your pull request is tested for all combinations of functions; otherwise, it may be rejected.

We welcome bug reports, suggestions, and contributions of new models.

📄 License

MIT License — see LICENSE for details.

🔍 Author

Alberto Valdés

ML/AI Engineer | MLOps Engineer | Open Source Contributor

GitHub: @anvaldes

Owner

Name: Alberto Valdés
Login: anvaldes
Kind: user

Repositories: 1
Profile: https://github.com/anvaldes

JOSS Publication

sklearn-migrator: Cross-version migration of scikit-learn models for reproducible MLOps

Published

May 19, 2026

DOI

10.21105/joss.10374

Volume 11, Issue 121, Page 10374

Authors

Alberto Andres Valdes Gonzalez

Independent Researcher (Chile)

Editor

Mehmet Hakan Satman

Citation (CITATION.cff)

cff-version: 1.2.0

title: "sklearn-migrator: Cross-version migration of scikit-learn models"

message: "If you use this software, please cite it as below."

type: software

authors:
  - family-names: Valdes Gonzalez
    given-names: Alberto Andres
    orcid: "https://orcid.org/0009-0000-0752-8519"

repository-code: "https://github.com/anvaldes/sklearn-migrator"

url: "https://doi.org/10.5281/zenodo.17917931"

license: MIT

version: 0.21.1

date-released: 2025-12-12

doi: 10.5281/zenodo.17917931

preferred-citation:
  type: software
  title: "sklearn-migrator: Cross-version migration of scikit-learn models"
  authors:
    - family-names: Valdes Gonzalez
      given-names: Alberto Andres
      orcid: "https://orcid.org/0009-0000-0752-8519"
  version: 0.21.1
  year: 2025
  doi: 10.5281/zenodo.17917931
  publisher: Zenodo

GitHub Events

Total

Release event: 1
Delete event: 36
Member event: 1
Pull request event: 127
Issues event: 2
Watch event: 10
Issue comment event: 5
Push event: 92
Pull request review event: 54
Create event: 33

Last Year

Release event: 1
Delete event: 36
Member event: 1
Pull request event: 127
Issues event: 2
Watch event: 10
Issue comment event: 5
Push event: 92
Pull request review event: 54
Create event: 33

Committers

Last synced: 2 months ago

All Time

Total Commits: 117
Total Committers: 3
Avg Commits per committer: 39.0
Development Distribution Score (DDS): 0.359

Past Year

Commits: 117
Committers: 3
Avg Commits per committer: 39.0
Development Distribution Score (DDS): 0.359

Top Committers

Name	Email	Commits
Alberto Valdes	b**s@M**l	75
Alberto	a**s@u**l	40
github-actions	g**s@g**m	2

Committer Domains (Top 20 + Academic)

github.com: 1 uc.cl: 1

Issues and Pull Requests

Last synced: about 2 months ago

All Time

Total issues: 1
Total pull requests: 71
Average time to close issues: N/A
Average time to close pull requests: 7 minutes
Total issue authors: 1
Total pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 62
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 71
Average time to close issues: N/A
Average time to close pull requests: 7 minutes
Issue authors: 1
Pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 62
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

olacognite (1)

Pull Request Authors

anvaldes (70)
jbytecode (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 324 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 31
Total maintainers: 1

pypi.org: sklearn-migrator

A utility to migrate scikit-learn models between versions.

Homepage: https://github.com/anvaldes/sklearn-migrator
Documentation: https://github.com/anvaldes/sklearn-migrator#readme
License: MIT License
Latest release: 0.22.0
published 2 months ago

Versions: 31
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 324 Last month

Rankings

Dependent packages count: 8.8%

Average: 29.3%

Dependent repos count: 49.8%

Maintainers (1)

anvaldes

Last synced: 2 months ago

sklearn-migrator: Cross-version migration of scikit-learn models for reproducible MLOps

Science Score: 92.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

sklearn-migrator 🧪

🚀 Motivation

❌ These serialized models break when scikit-learn versions change

❌ This creates real problems:

✅ What sklearn-migrator provides

✔ Serialize any supported model into a JSON-compatible dictionary

✔ Deserialize and reconstruct the model in a different scikit-learn version

✔ Remove dependency on pickle/joblib for long-term storage

✔ Enable reproducible ML pipelines across environments

💡 Supported Models (21 models)

📘 Classification

📗 Regression

📙 Clustering

📘 Dimensionality Reduction

🔢 Version Compatibility Matrix

📊 Validation Metric

📂 Installation

📚 API Documentation

💥 Use Cases

1. Using two python environments

a. Serialize the model

b. Deserialize the model

2. Docker: Step by Step

🔧 Development

Run tests locally

🤝 Contributing

📄 License

🔍 Author

Owner

JOSS Publication

sklearn-migrator: Cross-version migration of scikit-learn models for reproducible MLOps

Authors

Editor

Tags

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: sklearn-migrator

Rankings

Maintainers (1)

Dependencies

✅ What `sklearn-migrator` provides