sklearn-migrator: Cross-version migration of scikit-learn models for reproducible MLOps
sklearn-migrator: Cross-version migration of scikit-learn models for reproducible MLOps - Published in JOSS (2026)
Science Score: 92.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 4 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Repository
Basic Info
- Host: GitHub
- Owner: anvaldes
- License: mit
- Language: Python
- Default Branch: dev
- Size: 983 KB
Statistics
- Stars: 11
- Watchers: 2
- Forks: 2
- Open Issues: 8
- Releases: 2
Metadata Files
README.md
sklearn-migrator 🧪
A Python library to serialize and migrate scikit-learn models across incompatible versions.
🚀 Motivation
Machine learning teams frequently store trained scikit-learn models using pickle or joblib.
However:
❌ These serialized models break when scikit-learn versions change
- Internal attributes change
- APIs evolve (e.g.,
affinity → metric) - Tree and boosting internals get reorganized
- New default parameters appear
❌ This creates real problems:
- Production services fail after dependency upgrades
- Research becomes non-reproducible
- Long-term model governance becomes impossible
- Models can't be migrated or audited reliably
✅ What sklearn-migrator provides
✔ Serialize any supported model into a JSON-compatible dictionary
✔ Deserialize and reconstruct the model in a different scikit-learn version
✔ Remove dependency on pickle/joblib for long-term storage
✔ Enable reproducible ML pipelines across environments
This library has been validated across 1,024 version migration pairs (from → to), covering:
0.21.3 → 1.7.2
💡 Supported Models (21 models)
sklearn-migrator supports 21 core models across classification, regression, clustering, and dimensionality reduction.
📘 Classification
| Model | Supported | |------------------------------|-----------| | DecisionTreeClassifier | ✅ | | RandomForestClassifier | ✅ | | GradientBoostingClassifier | ✅ | | LogisticRegression | ✅ | | KNeighborsClassifier | ✅ | | SVC (Support Vector Classifier) | ✅ | | MLPClassifier | ✅ |
📗 Regression
| Model | Supported | |------------------------------|-----------| | DecisionTreeRegressor | ✅ | | RandomForestRegressor | ✅ | | GradientBoostingRegressor | ✅ | | LinearRegression | ✅ | | Ridge | ✅ | | Lasso | ✅ | | KNeighborsRegressor | ✅ | | SVR (Support Vector Regressor) | ✅ | | AdaBoostRegressor | ✅ | | MLPRegressor | ✅ |
📙 Clustering
| Model | Supported | |----------------------|-----------| | KMeans | ✅ | | MiniBatchKMeans | ✅ | | Agglomerative | ✅ |
📘 Dimensionality Reduction
| Model | Supported | |-------|-----------| | PCA | ✅ |
🔢 Version Compatibility Matrix
The library supports model migrations across the full matrix:
- 32 versions
- 1,024 migration pairs
- Fully tested using automated environments via CI/CD on every push
python
versions = [
'0.21.3', '0.22.0', '0.22.1', '0.23.0', '0.23.1', '0.23.2',
'0.24.0', '0.24.1', '0.24.2', '1.0.0', '1.0.1', '1.0.2',
'1.1.0', '1.1.1', '1.1.2', '1.1.3', '1.2.0', '1.2.1', '1.2.2',
'1.3.0', '1.3.1', '1.3.2', '1.4.0', '1.4.2', '1.5.0', '1.5.1',
'1.5.2', '1.6.0', '1.6.1', '1.7.0', '1.7.1', '1.7.2'
]
| From \ To | 0.21.3 | 0.22.0 | ... | 1.7.2 | | --------- | ------ | ------ | --- | ----- | | 0.21.3 | ✅ | ✅ | ... | ✅ | | 0.22.0 | ✅ | ✅ | ... | ✅ | | ... | ... | ... | ... | ... | | 1.7.2 | ✅ | ✅ | ... | ✅ |
📊 Validation Metric
Each migration pair (version_in, version_out) is validated using:
$$\max |y{\text{in}} - y{\text{out}}| < 10^{-2}$$
Where: - $y{\text{in}}$: predictions from the model in the source version - $y{\text{out}}$: predictions from the migrated model in the target version
The worst case across all 1,024 pairs is obtained via:
python
df_performance.abs().max().max() # global worst case (32x32 matrix)
⚠️ All 1,024 combinations and 21 models are automatically tested on every push via CI/CD, using isolated Docker environments for each sklearn version. Each model is validated under a representative parameter configuration; exhaustive combinatorial testing of all parameter combinations is outside the current scope.
📂 Installation
bash
pip install sklearn-migrator
📚 API Documentation
For full API documentation covering all 21 models, function signatures, parameters, return types, and usage examples, see API.md.
💥 Use Cases
- Long-term model storage: Store models in a future-proof format across teams and systems.
- Production model migration: Move models safely across major
scikit-learnupgrades. - Auditing and inspection: Read serialized models as JSON, inspect structure, hyperparameters, and internals.
- Cross-platform inference: Serialize in Python, serve elsewhere (e.g., microservices).
1. Using two python environments
You can serialize the model from an environment with a scikit-learn version (for example 1.5.0) and then deserialize the model from another environment with a different version (for example 1.7.0).
The deserialized model has the version of the environment where you deserialized it. In this case 1.7.0.
It is important to understand what version of scikit-learn you want to migrate from, and what version you want to migrate to, in order to create the appropriate environments for serialization and deserialization.
a. Serialize the model
```python import json import sklearn import numpy as np from sklearn.datasets import makeregression from sklearn.modelselection import traintestsplit from sklearn.ensemble import RandomForestRegressor from sklearnmigrator.regression.randomforestreg import serializerandomforestreg
X, y = makeregression(nsamples=200, nfeatures=10, randomstate=42) Xtrain, Xtest, ytrain, ytest = traintestsplit(X, y, testsize=0.2, randomstate=42)
model = RandomForestRegressor().fit(Xtrain, ytrain) predictions = model.predict(Xtest) data = serializerandomforestreg(model, sklearn.version)
with open("model.json", "w") as f: json.dump(data, f) ```
b. Deserialize the model
```python import json import sklearn from sklearnmigrator.regression.randomforestreg import deserializerandomforestreg
with open("model.json") as f: data = json.load(f)
newmodel = deserializerandomforestreg(data, sklearn.version) newpredictions = newmodel.predict(X_test) ```
2. Docker: Step by Step
You have a Random Forest Classifier saved in a .pkl format and it is called model.pkl. The version of this model is 1.5.0.
i. Create in your Desktop the next folder:
bash
/test_github
And copy your model.pkl in this folder.
ii. The Dockerfiles and requirements for all supported input versions are available in the integration/environments/input/ directory of this repository. Copy the files for your input version (e.g., 1.5.0):
bash
/test_github/input/1.5.0/Dockerfile_input
/test_github/input/1.5.0/requirements_input.txt
iii. The Dockerfiles and requirements for all supported output versions are available in the integration/environments/output/ directory of this repository. Copy the files for your output version (e.g., 1.7.0):
bash
/test_github/output/1.7.0/Dockerfile_output
/test_github/output/1.7.0/requirements_output.txt
iv. Now you create your input.py:
```python import json import joblib import sklearn import numpy as np import pandas as pd from joblib import load
from sklearn.ensemble import RandomForestClassifier
from sklearnmigrator.classification.randomforestclf import serializerandomforestclf
versionsklearnin = sklearn.version
model = load('model.pkl')
alldata = serializerandomforestclf(model, versionsklearnin)
def convert(o): if isinstance(o, (np.integer, np.int64)): return int(o) elif isinstance(o, (np.floating, np.float64)): return float(o) elif isinstance(o, np.ndarray): return o.tolist() else: raise TypeError(f"Object of type {type(o).name} is not JSON serializable")
with open("inputmodel/alldata.json", "w") as f: json.dump(all_data, f, default=convert)
fake_row = np.array([[0.5, -1.2, 0.3, 1.1, -0.7, 0.9, 0.0, -0.3, 1.5, 0.2]])
ypred = pd.DataFrame(model.predictproba(fakerow)) ypred.tocsv('inputmodel/y_pred.csv', index=False) ```
v. Now you create your output.py:
```python import json import joblib import sklearn import numpy as np import pandas as pd from joblib import load
from sklearn.ensemble import RandomForestClassifier
from sklearnmigrator.classification.randomforestclf import deserializerandomforestclf
versionsklearnout = sklearn.version
with open("inputmodel/alldata.json", "r") as f: all_data = json.load(f)
newmodel = deserializerandomforestclf(alldata, versionsklearn_out)
joblib.dump(newmodel, 'outputmodel/new_model.pkl')
fake_row = np.array([[0.5, -1.2, 0.3, 1.1, -0.7, 0.9, 0.0, -0.3, 1.5, 0.2]])
yprednew = pd.DataFrame(newmodel.predictproba(fakerow)) yprednew.tocsv('outputmodel/ypred_new.csv', index=False) ```
vi. Now you copy all the files:
bash
cp input/1.5.0/* output/1.7.0/* .
vii. Now you create two folders: input_model/ and output_model/.
viii. Execute the next commands in your terminal (you should be in the root of test_github/ folder):
```bash docker build -f Dockerfileinput -t imageinput1.5.0 . docker build -f Dockerfileoutput -t imageoutput1.7.0 .
docker run --rm \ -v "$(pwd)/inputmodel:/app/inputmodel" \ -v "$(pwd)/model.pkl:/app/model.pkl" \ imageinput1.5.0
docker run --rm \ -v "$(pwd)/inputmodel:/app/inputmodel" \ -v "$(pwd)/outputmodel:/app/outputmodel" \ imageoutput1.7.0 ```
ix. Finally you can find your migrated model in the folder /output_model and its name is new_model.pkl. This model is a scikit-learn model of version 1.7.0.
🔧 Development
Run tests locally
bash
pytest tests/
Integration tests run automatically on every push via CI/CD.
🤝 Contributing
- Fork the repository
- Create a new branch
feature/my-feature - Open a pull request
- Please ensure your pull request is tested for all combinations of functions; otherwise, it may be rejected.
We welcome bug reports, suggestions, and contributions of new models.
📄 License
MIT License — see LICENSE for details.
🔍 Author
Alberto Valdés
ML/AI Engineer | MLOps Engineer | Open Source Contributor
GitHub: @anvaldes
Owner
- Name: Alberto Valdés
- Login: anvaldes
- Kind: user
- Repositories: 1
- Profile: https://github.com/anvaldes
JOSS Publication
sklearn-migrator: Cross-version migration of scikit-learn models for reproducible MLOps
Tags
machine learning scikit-learn MLOps model reproducibility model migration model persistenceCitation (CITATION.cff)
cff-version: 1.2.0
title: "sklearn-migrator: Cross-version migration of scikit-learn models"
message: "If you use this software, please cite it as below."
type: software
authors:
- family-names: Valdes Gonzalez
given-names: Alberto Andres
orcid: "https://orcid.org/0009-0000-0752-8519"
repository-code: "https://github.com/anvaldes/sklearn-migrator"
url: "https://doi.org/10.5281/zenodo.17917931"
license: MIT
version: 0.21.1
date-released: 2025-12-12
doi: 10.5281/zenodo.17917931
preferred-citation:
type: software
title: "sklearn-migrator: Cross-version migration of scikit-learn models"
authors:
- family-names: Valdes Gonzalez
given-names: Alberto Andres
orcid: "https://orcid.org/0009-0000-0752-8519"
version: 0.21.1
year: 2025
doi: 10.5281/zenodo.17917931
publisher: Zenodo
GitHub Events
Total
- Release event: 1
- Delete event: 36
- Member event: 1
- Pull request event: 127
- Issues event: 2
- Watch event: 10
- Issue comment event: 5
- Push event: 92
- Pull request review event: 54
- Create event: 33
Last Year
- Release event: 1
- Delete event: 36
- Member event: 1
- Pull request event: 127
- Issues event: 2
- Watch event: 10
- Issue comment event: 5
- Push event: 92
- Pull request review event: 54
- Create event: 33
Committers
Last synced: 14 days ago
Top Committers
| Name | Commits | |
|---|---|---|
| Alberto Valdes | b****s@M****l | 75 |
| Alberto | a****s@u****l | 40 |
| github-actions | g****s@g****m | 2 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 9 days ago
All Time
- Total issues: 1
- Total pull requests: 71
- Average time to close issues: N/A
- Average time to close pull requests: 7 minutes
- Total issue authors: 1
- Total pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 62
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 71
- Average time to close issues: N/A
- Average time to close pull requests: 7 minutes
- Issue authors: 1
- Pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 62
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- olacognite (1)
Pull Request Authors
- anvaldes (70)
- jbytecode (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 324 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 31
- Total maintainers: 1
pypi.org: sklearn-migrator
A utility to migrate scikit-learn models between versions.
- Homepage: https://github.com/anvaldes/sklearn-migrator
- Documentation: https://github.com/anvaldes/sklearn-migrator#readme
- License: MIT License
-
Latest release: 0.22.0
published 20 days ago
Rankings
Maintainers (1)
Dependencies
- scikit-learn >=0.21.3
