Recent Releases of kedro
kedro - 1.0.0
Major features and improvements
Data Catalog
- The previously experimental
KedroDataCataloghas been renamed toDataCatalogand is now the default catalog implementation. - It retains the dict-like interface, supports lazy dataset initialisation, and delivers improved performance.
- While this change is seamless for users following standard Kedro workflows, it introduces a richer API for programmatic use:
- New pipeline-aware commands, available via both the CLI and interactive environments.
- Simplified handling of dataset factories.
- Centralised pattern resolution via the
CatalogConfigResolverproperty. - Ability to serialise the catalog to configuration and reconstruct it from it.
Read more in the Kedro documentation.
Namespaces
- Added support for running multiple namespaces within a single session with
--namespacesCLI option andnamespacesargument inKedroSession.run()method. - Improved namespace validation efficiency to prevent significant slowdowns when creating large pipelines.
- Added stricter validation to dataset names in the
Nodeclass, ensuring.characters are reserved to be used as part of a namespace. - Added a
prefix_datasets_with_namespaceargument to thePipelineclass which allows users to turn on or off the prefixing of the namespace to the node inputs, outputs, and parameters. - Changed pipeline filtering for namespace to return exact namespace matches instead of partial matches.
Other features and improvements
- Changed the default node name to be formed of the function name used in the node suffixed by a secure hash (SHA-256) based on the function, inputs, and outputs, ensuring uniqueness and improved readability.
- Added an option to select which multiprocessing start method is going to be used on
ParallelRunnervia theKEDRO_MP_CONTEXTenvironment variable. - Added
--only-missing-outputsCLI flag tokedro run. This flag skips nodes when all their persistent outputs exist. - Updated
kedro registry describeto return the node name property instead of creating its own name for the node. - Removed
pre-commit-hooksdependency for new project creation.
Breaking changes to the API
CLI
kedro catalog createcommand has been removed.kedro catalog list,kedro catalog rank, andkedro catalog resolvecommands have been replaced withkedro catalog describe-datasets,kedro catalog list-patternsandkedro catalog resolve-patternscommands, respectively.- The
kedro runoption--namespacehas been removed and replaced with--namespaces. - The
kedro micropkgCLI command has been removed as part of the micro-packaging feature deprecation.
API
- Private methods
_is_projectand_find_kedro_projectare changed tois_kedro_projectandfind_kedro_project. - Renamed instances of
extra_paramsand_extra_paramstoruntime_params. - Removed the
modular_pipelinemodule and moved functionality to thepipelinemodule instead. - Renamed
ModularPipelineErrortoPipelineError. Pipeline.grouped_nodes_by_namespace()was replaced withgroup_nodes_by(group_by), which supports multiple strategies and returns a list ofGroupedNodes, improving type safety and consistency for deployment plugin integrations.- Renamed
session_idparameter torun_idin all runner methods and hooks to improve API clarity and prepare for future multi-run session support. - Removed the following
DataCatalogmethods:_get_dataset(),add_all(),add_feed_dict(),list(), andshallow_copy(). - Changed the output of
runner.run()andsession.run()— it now always returns all pipeline outputs, regardless of catalog configuration. - Removed the
AbstractRunner.run_only_missing()method, an older and underused API for partial runs. Please use--only-missing-outputsCLI instead.
Documentation changes
- Revamped the look and feel of the Kedro documentation, including a new theme and improved navigation with
mkdocsas the documentation engine. - Updated the
DataCatalogdocumentation with improved structure and detailed description of new features. Read the DataCatalog documentation here.
Community contributions
Many thanks to the following Kedroids for contributing PRs to this release: * Yury Fedotov * Kitsios Konstantinos
Migration guide from Kedro 0.19.* to 1.*
See the migration guide for 1.0.0 in the Kedro documentation.
- Python
Published by merelcht 10 months ago
kedro - 1.0.0rc3
Major features and improvements
Changed DataCatalog.__getitem__ to raise DatasetNotFoundError for missing datasets, aligning with expected dictionary behavior.
Bug fixes and other changes
Breaking changes to the API
Upcoming deprecations for Kedro 1.0.0
Documentation changes
Community contributions
- Python
Published by merelcht 11 months ago
kedro - 1.0.0rc2
Major features and improvements
- Added
--only-missing-outputsCLI flag tokedro run. This flag skips nodes when all their persistent outputs exist. - Removed the
AbstractRunner.run_only_missing()method, an older and underused API for partial runs. Please use--only-missing-outputsCLI instead.
Bug fixes and other changes
- Improved namespace validation efficiency to prevent significant slowdowns when creating large pipelines
Breaking changes to the API
Upcoming deprecations for Kedro 1.0.0
Documentation changes
Community contributions
- Python
Published by merelcht 11 months ago
kedro - 1.0.0rc1
Major features and improvements
- Added stricter validation to dataset names in the
Nodeclass, ensuring.characters are reserved to be used as part of a namespace. - Added a
prefix_datasets_with_namespaceargument to thePipelineclass which allows users to turn on or off the prefixing of the namespace to the node inputs, outputs, and parameters. - Changed the default node name to be formed of the function name used in the node suffixed by a secure hash (SHA-256) based on the function, inputs, and outputs, ensuring uniqueness and improved readability.
- Added an option to select which multiprocessing start method is going to be used on
ParallelRunnervia theKEDRO_MP_CONTEXTenvironment variable.
Bug fixes and other changes
- Changed pipeline filtering for namespace to return exact namespace matches instead of partial matches.
- Added support for running multiple namespaces within a single session.
- Updated
kedro registry describeto return the node name property instead of creating its own name for the node.
Documentation changes
- Updated the
DataCatalogdocumentation with improved structure and detailed description of new features.
Community contributions
Breaking changes to the API
- Private methods
_is_projectand_find_kedro_projectare changed tois_kedro_projectandfind_kedro_project. - Renamed instances of
extra_paramsand_extra_paramstoruntime_params. - Removed the
modular_pipelinemodule and moved functionality to thepipelinemodule instead. - Renamed
ModularPipelineErrortoPipelineError. Pipeline.grouped_nodes_by_namespace()was replaced withgroup_nodes_by(group_by), which supports multiple strategies and returns a list ofGroupedNodes, improving type safety and consistency for deployment plugin integrations.- The micro-packaging feature and the corresponding
micropkgCLI command have been removed. - Renamed
session_idparameter torun_idin all runner methods and hooks to improve API clarity and prepare for future multi-run session support. - Removed the following
DataCatalogmethods:_get_dataset(),add_all(),add_feed_dict(),list(), andshallow_copy(). - Removed the CLI command
kedro catalog create. - Changed the output of
runner.run()— it now always returns all pipeline outputs, regardless of catalog configuration.
Migration guide from Kedro 0.19.* to 1.*
See the migration guide for 1.0.0 in the Kedro documentation.
- Python
Published by merelcht 12 months ago
kedro - 0.19.14
Major features and improvements
- Added execution time to pipeline completion log. ## Bug fixes and other changes
- Fixed a recursion error in custom datasets when
_describe()accessedself.__dict__. ## Community contributions Many thanks to the following Kedroids for contributing PRs to this release: - Yury Fedotov
- Python
Published by merelcht 12 months ago
kedro - 0.19.13
Major features and improvements
- Unified
pipeline()andPipelineinto a single module (kedro.pipeline), aligning with thenode()/Nodedesign pattern and improving namespace handling.
Bug fixes and other changes
- Fixed bug where project creation workflow would use the
mainbranch version ofkedro-startersinstead of the respective release version. - Fixed namespacing for
confirmsduring pipeline creation to supportIncrementalDataset. - Fixed bug where
OmegaConfcause an error during config resolution with runtime parameters. - Cached
inputsinNodewhen created from dictionary for better performance. - Enabled pluggy tracing only when logging level is set to
DEBUGto speed up the execution of project runs.
Upcoming deprecations for Kedro 1.0.0
- Added a deprecation warning for catalog CLI commands. The following commands will be replaced with their alternatives -
kedro catalog rank,kedro catalog list,kedro catalog resolveand thekedro catalog createcommand will be removed. - Added a deprecation warning for
KedroDataCatalogthat will replaceDataCatalogwhile adopting the originalDataCatalogname. - Add deprecation warning for
--namespaceoption forkedro run. It will be replaced with--namespacesoption which will allow for running multiple namespaces together. - The
modular_pipelinemodule is deprecated and will be removed in Kedro 1.0.0. Use thepipelinemodule instead.
Note: On March 20th, a security vulnerability, CVE-2024-12215, was identified in Kedro. This issue stems from the deprecated micropackaging functionality, which is scheduled for removal in the upcoming Kedro 1.0 release. While we agree with the CVE assigned, this vulnerability only poses a risk if you pull a malicious micropackage from an untrusted source. If you're concerned, we recommend avoiding the micropackaging feature for now and upgrading to Kedro 1.0 once it's released.
Documentation changes
- Updated Dask deployment docs.
- Added non-jupyter environment integration page (for example Marimo) with dynamic Kedro session loading.
Community contributions
Many thanks to the following Kedroids for contributing PRs to this release: * Arnout Verboven * gabohc * Luis Chaves Rodriguez
- Python
Published by merelcht about 1 year ago
kedro - 0.19.12
Major features and improvements
- Added
KedroDataCatalog.filter()to filter datasets by name and type. - Added
Pipeline.grouped_nodes_by_namespaceproperty which returns a dictionary of nodes grouped by namespace, intended to be used by plugins to facilitate deployment of namespaced nodes together. - Added support for cloud storage protocols in
--conf-source, allowing configuration to be loaded from remote locations such as S3.
Bug fixes and other changes
- Added
DataCatalogdeprecation warning. - Updated
_LazyDatasetrepresentation when printingKedroDataCatalog. - Fixed
MemoryDatasetto inferassigncopy mode for Ibis Tables, which previously would be inferred asdeepcopy. - Fixed pipeline packaging issue by ensuring
pipelines/__init__.pyexists when creating new pipelines. - Changed the execution of
SequentialRunnerto not use an executor pool to ensure it's single threaded. - Fixed
%load_nodemagic command to work with Jupyter Notebook>=7.2.0. - Remove
7: Kedro Vizfrom Kedro tools. - Updated node grouping API to only group on first level of namespace.
Documentation changes
- Added documentation for Kedro's support for Delta Lake versioning.
- Added documentation for Kedro's support for Iceberg versioning.
- Added documentation for Kedro's nodes grouping in deployment.
- Fixed a minor grammatical error in Kedro-Viz installation instructions to improve documentation clarity.
- Improved the Kedro VSCode extension documentation.
- Updated the recommendations for nesting namespaces.
Community contributions
Many thanks to the following Kedroids for contributing PRs to this release: * Jacob Pieniazek * Lucas Vittor * Ean Jimenez * Toran Sahu
- Python
Published by merelcht about 1 year ago
kedro - 0.19.11
Major features and improvements
- Implemented
KedroDataCatalog.to_config()method that converts the catalog instance into a configuration format suitable for serialization. - Improve OmegaConfigLoader performance.
- Replaced
trufflehogwithdetect-secretsfor detecting secrets within a code base. - Added support for
%load_ext kedro.
Bug fixes and other changes
- Added validation to ensure dataset versions consistency across catalog.
- Fixed a bug in project creation when using a custom starter template offline.
- Added
nodeimport to the pipeline template. - Update error message when executing kedro run without pipeline.
- Safeguard hooks when user incorrectly registers a hook class in settings.py.
- Fixed parsing paths with query and fragment.
- Remove lowercase transformation in regex validation.
- Moved
kedro-catalogJSON schema tokedro-datasets. - Updated
Partitioned dataset lazy savingdocs page. - Fixed
KedroDataCatalogmutation after pipeline run. - Made
KedroDataCatalog._datasetscompatible withDataCatalog._datasets.
Community contributions
Many thanks to the following Kedroids for contributing PRs to this release: * Hendrik Scherner * Chris Schopp
- Python
Published by merelcht over 1 year ago
kedro - 0.19.10
Major features and improvements
- Add official support for Python 3.13.
- Implemented dict-like interface for
KedroDataCatalog. - Implemented lazy dataset initializing for
KedroDataCatalog. - Project dependencies on both the default template and on starter templates are now explicitly declared on the
pyproject.tomlfile, allowing Kedro projects to work with project management tools likeuv,pdm, andrye.
Note: KedroDataCatalog is an experimental feature and is under active development. Therefore, it is possible we'll introduce breaking changes to this class, so be mindful of that if you decide to use it already. Let us know if you have any feedback about the KedroDataCatalog or ideas for new features.
Bug fixes and other changes
- Added I/O support for Oracle Cloud Infrastructure (OCI) Object Storage filesystem.
- Fixed
DatasetAlreadyExistsErrorforThreadRunnerwhen Kedro project run and using runner separately.
Breaking changes to the API
Documentation changes
- Added Databricks Asset Bundles deployment guide.
- Added a new minimal Kedro project creation guide.
- Added example to explain how dataset factories work.
- Updated CLI autocompletion docs with new Click syntax.
- Standardised
.parquetsuffix in docs and tests.
Community contributions
Many thanks to the following Kedroids for contributing PRs to this release: * G. D. McBain * Greg Vaslowski * Hyewon Choi * Pedro Antonacio
- Python
Published by merelcht over 1 year ago
kedro - 0.19.9
Major features and improvements
- Dropped Python 3.8 support.
- Implemented
KedroDataCatalogrepeatingDataCatalogfunctionality with a few API enhancements:- Removed
_FrozenDatasetsand access datasets as properties; - Added get dataset by name feature;
add_feed_dict()was simplified to only add raw data;- Datasets' initialisation was moved out from
from_config()method to the constructor.
- Removed
- Moved development requirements from
requirements.txtto the dedicated section inpyproject.tomlfor project template. - Implemented
Protocolabstraction for the currentDataCatalogand adding new catalog implementations. - Refactored
kedro runandkedro catalogcommands. - Moved pattern resolution logic from
DataCatalogto a separate component -CatalogConfigResolver. UpdatedDataCatalogto useCatalogConfigResolverinternally. - Made packaged Kedro projects return
session.run()output to be used when running it in the interactive environment. - Enhanced
OmegaConfigLoaderconfiguration validation to detect duplicate keys at all parameter levels, ensuring comprehensive nested key checking.
Note: KedroDataCatalog is an experimental feature and is under active development. Therefore, it is possible we'll introduce breaking changes to this class, so be mindful of that if you decide to use it already. Let us know if you have any feedback about the KedroDataCatalog or ideas for new features.
Bug fixes and other changes
- Fixed bug where using dataset factories breaks with
ThreadRunner. - Fixed a bug where
SharedMemoryDataset.existswould not call the underlyingMemoryDataset. - Fixed template projects example tests.
- Made credentials loading consistent between
KedroContext._get_catalog()andresolve_patternsso that both use_get_config_credentials()
Breaking changes to the API
- Removed
ShelveStoreto address a security vulnerability.
Documentation changes
- Fix logo on PyPI page.
- Minor language/styling updates.
Community contributions
- Python
Published by merelcht over 1 year ago
kedro - 0.19.8
Major features and improvements
- Made default run entrypoint in
__main__.pywork in interactive environments such as IPyhon and Databricks.
Bug fixes and other changes
- Fixed a bug that caused tracebacks disappeared from CLI runs.
- Moved
_find_run_command()and_find_run_command_in_plugins()from__main__.pyin the project template to the framework itself. - Fixed a bug where
%load_nodebreaks with multi-lines import statements. - Fixed a regression where
richmark up logs stop showing since 0.19.7.
Breaking changes to the API
Documentation changes
- Add clarifications in docs explaining how runtime parameter resolution works.
Community contributions
Many thanks to the following Kedroids for contributing PRs to this release: * cclauss * eltociear * ltalirz
- Python
Published by merelcht almost 2 years ago
kedro - 0.19.7
Major features and improvements
- Exposed
loadandsavepublicly for each dataset in the corekedrolibrary, and enabled other datasets to do the same. If a dataset doesn't exposeloadorsavepublicly, Kedro will fall back to using_loador_save, respectively. - Kedro commands are now lazily loaded to add performance gains when running Kedro commands.
- Implemented key completion support for accessing datasets in the
DataCatalog. - Implemented dataset pretty printing.
- Implemented
DataCatalogpretty printing. - Moved to an opt-out model for telemetry, enabling it by default without requiring prior consent.
Bug fixes and other changes
- Updated error message for invalid catalog entries.
- Updated error message for catalog entries when the dataset class is not found with hints on how to resolve the issue.
- Fixed a bug in the
DataCatalogshallow_copy()method to ensure it returns the type of the used catalog and doesn't cast it toDataCatalog. - Made kedro-telemetry a core dependency.
- Fixed a bug when
OmegaConfigLoaderis printed, there are few missing arguments. - Fixed a bug when where iterating
OmegaConfigLoader'skeysreturn empty dictionary.
Breaking changes to the API
Upcoming deprecations for Kedro 0.20.0
- The utility method
get_pkg_version()is deprecated and will be removed in Kedro 0.20.0.
Documentation changes
- Improved documentation for configuring dataset parameters in the data catalog
- Extended documentation with an example of logging customisation at runtime
Community contributions
Many thanks to the following Kedroids for contributing PRs to this release: * nickolasrm * yury-fedotov
- Python
Published by merelcht almost 2 years ago
kedro - 0.19.6
Major features and improvements
- Added
raise_errorsargument tofind_pipelines. IfTrue, the first pipeline for which autodiscovery fails will cause an error to be raised. The default behaviour is still to raise a warning for each failing pipeline. - It is now possible to use Kedro without having
richinstalled. - Updated custom logging behavior:
conf/logging.ymlwill be used if it exists andKEDRO_LOGGING_CONFIGis not set; otherwise,default_logging.ymlwill be used.
Bug fixes and other changes
- User defined catch-all dataset factory patterns now override the default pattern provided by the runner.
Breaking changes to the API
Upcoming deprecations for Kedro 0.20.0
- All micro-packaging commands (
kedro micropkg pull,kedro micropkg package) are deprecated and will be removed in Kedro 0.20.0.
Documentation changes
- Improved documentation for custom starters
- Added a new docs section on deploying Kedro project on AWS Airflow MWAA
- Detailed instructions on using
globalsandruntime_paramswith theOmegaConfigLoader
Community contributions
Many thanks to the following Kedroids for contributing PRs to this release: * doxenix * cleeeks
- Python
Published by merelcht about 2 years ago
kedro - 0.19.5
Bug fixes and other changes
- Fixed breaking import issue when working on a project with
kedro-vizon python 3.8.
Documentation changes
- Updated the documentation for deploying a Kedro project with Astronomer Airflow.
- Used
kedro-sphinx-themefor documentation.
- Python
Published by merelcht about 2 years ago
kedro - 0.19.4
Major features and improvements
- Kedro commands now work from any subdirectory within a Kedro project.
- Kedro CLI now provides a better error message when project commands are run outside of a project i.e.
kedro run - Added the
--telemetryflag tokedro new, allowing the user to register consent to have user analytics collected at the same time as the project is created. - Improved the performance of
Pipelineobject creation and summing. - Improved suggestions to resume failed pipeline runs.
- Dropped the dependency on
toposortin favour of the built-ingraphlibmodule. - Cookiecutter errors are shown in short format without the
--verboseflag.
Bug fixes and other changes
- Updated
kedro pipeline createandkedro pipeline deleteto read the base environment from the project settings. - Updated CLI command
kedro catalog resolveto read credentials properly. - Changed the path of where pipeline tests generated with
kedro pipeline createfrom<project root>/src/tests/pipelines/<pipeline name>to<project root>/tests/pipelines/<pipeline name>. - Updated
.gitignoreto prevent pushing Mlflow local runs folder to a remote forge when using mlflow and git. - Fixed error handling message for malformed yaml/json files in OmegaConfigLoader.
- Fixed a bug in
node-creation allowing self-dependencies when using transcoding, that is datasets named likename@format. - Improved error message when passing wrong value to node.
Breaking changes to the API
- Methods
_is_projectand_find_kedro_projecthave been moved tokedro.utils. We recommend not using private methods in your code, but if you do, please update your code to use the new location.
Documentation changes
- Added missing description for
merge_strategyargument in OmegaConfigLoader. - Added documentation on best practices for testing nodes and pipelines.
- Clarified docs around using custom resolvers without a full Kedro project.
Community contributions
Many thanks to the following Kedroids for contributing PRs to this release:
- Python
Published by merelcht about 2 years ago
kedro - 0.19.3
Major features and improvements
- Create the debugging line magic
%load_nodefor Jupyter Notebook and Jupyter Lab. - Add better IPython, VSCode Notebook support for
%load_nodeand minimal support for Databricks. - Add full Kedro Node input syntax for
%load_node.
Bug fixes and other changes
- Updated CLI Command
kedro catalog resolveto work with dataset factories that usePartitionedDataset. - Addressed arbitrary file write via archive extraction security vulnerability in micropackaging.
- Added the
_EPHEMERALattribute toAbstractDatasetand other Dataset classes that inherit from it. - Added new JSON Schema that works with Kedro versions 0.19.*
Breaking changes to the API
Documentation changes
- Enable read-the-docs search when user presses Command/Ctrl + K.
- Added documentation for
kedro-telemetryand the data collected by it.
Community contributions
Many thanks to the following Kedroids for contributing PRs to this release: * MosaicMan * Fazil
- Python
Published by merelcht over 2 years ago
kedro - 0.19.2
Bug fixes and other changes
- Removed example pipeline requirements when examples are not selected in
tools. - Allowed modern versions of JupyterLab and Jupyter Notebooks.
- Removed setuptools dependency
- Added
source_direxplicitly inpyproject.tomlfor non-src layout project. MemoryDatasetentries are now included in free outputs.- Removed black dependency and replaced it functionality with
ruff format. - Added logging about not using async mode in
SequentiallRunnerandParallelRunner.
Breaking changes to the API
- Changed input format for tools option obtained from --config file from numbers to short names.
Documentation changes
- Added documentation about
bootstrap_projectandconfigure_project. - Added documentation about
kedro runand hook execution order.
- Python
Published by merelcht over 2 years ago
kedro - 0.19.0
:rocket: Major Features and improvements
- Dropped Python 3.7 support.
- Introduced project tools and example to the kedro new CLI flow.
- The new spaceflights starters, spaceflights-pandas, spaceflights-pandas-viz, spaceflights-pyspark, and spaceflights-pyspark-viz can be used with the kedro new command with the
--starterflag. - Added the
--conf-sourceoption to%reload_kedro, allowing users to specify a source for project configuration. - Added the functionality to choose a merging strategy for config files loaded with OmegaConfigLoader.
- Modified the mechanism of importing datasets, raise more explicit error when dependencies are missing.
- Added validation for configuration file used to override run commands via the CLI.
- Moved the default environment base and local from config loader to
_ProjectSettings. This enables the use of config loader as a standalone class without affecting existing Kedro Framework users.
:beetle: Bug fixes and other changes
- Added a new field tools to pyproject.toml when a project is created.
- Reduced spaceflights data to minimise waiting times during tutorial execution.
- Added validation to node tags to be consistent with node names.
- Removed pip-tools as a dependency.
- Accepted path-like filepaths more broadly for datasets.
:boom: Breaking changes
- Removed ConfigLoader and TemplatedConfigLoader.
- Removed kedro.extras.datasets and tests (use kedro-datasets instead)
- Removed PartitionedDataset and IncrementalDataset from
kedro.io(import them from kedro-datasets instead) - logging is removed from OmegaConfigLoader in favour of the environment variable
KEDRO_LOGGING_CONFIG. - Removed support for defining the layer attribute at top-level within DataCatalog.
- Renamed
data_setand DataSet to dataset and Dataset everywhere. - Removed the
create_default_data_set()method in the Runner in favour of using dataset factories to create default dataset instances. - The default project template now has only one pyproject.toml at the root of the project (containing both the packaging metadata and the Kedro build config).
:writing_hand: Documentation changes
- Added new top navigation to easily switch between Framework, Viz, and Datasets.
- Added new search-as-you-type to improve the search experience.
New Contributors * @MinuraPunchihewa made their first contribution in https://github.com/kedro-org/kedro/pull/3115 * @mustious made their first contribution in https://github.com/kedro-org/kedro/pull/3181 * @JayOaks made their first contribution in https://github.com/kedro-org/kedro/pull/3239 * @adamkells made their first contribution in https://github.com/kedro-org/kedro/pull/3203 * @HKABIG made their first contribution in https://github.com/kedro-org/kedro/pull/3270 * @pdave34 made their first contribution in https://github.com/kedro-org/kedro/pull/3213 * @hermlon made their first contribution in https://github.com/kedro-org/kedro/pull/3303
Full Changelog: https://github.com/kedro-org/kedro/compare/0.18.14...0.19.0
:rotating_light: If you are upgrading from Kedro 0.18, have a look at the migration guide.
We welcome every community contribution, large or small. See what we're working on now and report bugs or suggest future features. Until next time, The Kedro Team :yellow_heart:
- Python
Published by idanov over 2 years ago
kedro - 0.18.14
Release 0.18.14
Major features and improvements
- Allowed using of custom cookiecutter templates for creating pipelines with
--templateflag forkedro pipeline createor viatemplate/pipelinefolder. - Allowed overriding of configuration keys with runtime parameters using the
runtime_paramsresolver withOmegaConfigLoader.
Bug fixes and other changes
- Updated dataset factories to resolve nested catalog config properly.
- Updated
OmegaConfigLoaderto handle paths containing dots outside ofconf_source. - Made
settings.pyoptional.
Documentation changes
- Added documentation to clarify execution order of hooks.
- Added a notebook example for spaceflights to illustrate how to incrementally add Kedro features.
- Moved documentation for the
standalone-datacatalogstarter into its README file. - Added new documentation about deploying a Kedro project with Amazon EMR.
- Added new documentation about how to publish a Kedro-Viz project to make it shareable.
- New TSC members added to the page and the organisation of each member is also now listed.
- Plus some minor bug fixes and changes across the documentation.
Upcoming deprecations for Kedro 0.19.0
- All dataset classes will be removed from the core Kedro repository (
kedro.extras.datasets). Install and import them from thekedro-datasetspackage instead. - All dataset classes ending with
DataSetare deprecated and will be removed in Kedro0.19.0andkedro-datasets2.0.0. Instead, use the updated class names ending withDataset. - The starters
pandas-iris,pyspark-iris,pyspark, andstandalone-datacatalogare deprecated and will be archived in Kedro 0.19.0. PartitionedDatasetandIncrementalDatasethave been moved tokedro-datasetsand will be removed in Kedro0.19.0. Install and import them from thekedro-datasetspackage instead.
Community contributions
Many thanks to the following Kedroids for contributing PRs to this release: * Jason Hite * IngerMathilde * Laíza Milena Scheid Parizotto * Richard * flpvvvv * qheuristics * Miguel Ortiz * rxm7706 * Iñigo Hidalgo * harmonys-qb * Yi Kuang * Jens Lordén
- Python
Published by idanov over 2 years ago
kedro - 0.18.12
Release 0.18.12
Major features and improvements
- Added dataset factories feature which uses pattern matching to reduce the number of catalog entries.
- Activated all built-in resolvers by default for
OmegaConfigLoaderexcept foroc.env. - Added
kedro catalog rankCLI command that ranks dataset factories in the catalog by matching priority.
Bug fixes and other changes
- Consolidated dependencies and optional dependencies in
pyproject.toml. - Made validation of unique node outputs much faster.
- Updated
kedro catalog listto show datasets generated with factories.
Documentation changes
- Recommended
ruffas the linter and removed mentions ofpylint,isort,flake8.
Community contributions
Thanks to Laíza Milena Scheid Parizotto and Chris Schopp.
Breaking changes to the API
Upcoming deprecations for Kedro 0.19.0
ConfigLoaderandTemplatedConfigLoaderwill be deprecated. Please useOmegaConfigLoaderinstead.
- Python
Published by idanov almost 3 years ago
kedro - 0.18.11
Release 0.18.11
Major features and improvements
- Added databricks-iris as an official starter. ## Bug fixes and other changes
- Reworked micropackaging workflow to use standard Python packaging practices.
- Make kedro micropkg package accept --verbose. ## Documentation changes
- Significant improvements to the documentation that covers working with Databricks and Kedro, including a new page for workspace-only development, and a guide to choosing the best workflow for your use case.
- Updated documentation for deploying with Prefect for version 2.0.
- Python
Published by idanov almost 3 years ago
kedro - 0.18.9
Major features and improvements
kedro run --paramsnow updates interpolated parameters correctly when usingOmegaConfigLoader.- Added
metadataattribute tokedro.iodatasets. This is ignored by Kedro, but may be consumed by users or external plugins. - Added
kedro.logging.RichHandler. This replaces the defaultrich.logging.RichHandlerand is more flexible, user can turn off therichtraceback if needed.
Bug fixes and other changes
OmegaConfigLoaderwill return adictinstead ofDictConfig.OmegaConfigLoaderdoes not show aMissingConfigErrorwhen the config files exist but are empty.
Documentation changes
- Added documentation for collaborative experiment tracking within Kedro-Viz.
- Revised section on deployment to better organise content and reflect how recently docs have been updated.
- Minor improvements to fix typos and revise docs to align with engineering changes.
Breaking changes to the API
kedro packagedoes not produce.eggfiles anymore, and now relies exclusively on.whlfiles.
Community contributions
Many thanks to the following Kedroids for contributing PRs to this release:
- Python
Published by idanov about 3 years ago
kedro - 0.18.8
Major features and improvements
- Added
KEDRO_LOGGING_CONFIGenvironment variable, which can be used to configure logging from the beginning of thekedroprocess. - Removed logs folder from the kedro new project template. File-based logging will remain but just be level INFO and above and go to project root instead.
Bug fixes and other changes
- Improvements to Jupyter E2E tests.
- Added full
kedro runCLI command to session store to improve run reproducibility usingKedro-Vizexperiment tracking.
Documentation changes
- Improvements to documentation about configuration.
- Improvements to Sphinx toolchain including incrementing to use a newer version.
- Improvements to documentation on visualising Kedro projects on Databricks, and additional documentation about the development workflow for Kedro projects on Databricks.
- Updated Technical Steering Committee membership documentation.
- Revised documentation section about linting and formatting and extended to give details of
flake8configuration. - Updated table of contents for documentation to reduce scrolling.
- Expanded FAQ documentation.
- Added a 404 page to documentation.
- Added deprecation warnings about the removal of
kedro.extras.datasets.
- Python
Published by idanov about 3 years ago
kedro - 0.18.7
Release 0.18.7
Major features and improvements
- Added new Kedro CLI
kedro jupyter setupto setup Jupyter Kernel for Kedro. kedro packagenow includes the project configuration in a compressedtar.gzfile.- Added functionality to the
OmegaConfigLoaderto load configuration from compressed files ofziportarformat. This feature requiresfsspec>=2023.1.0. - Significant improvements to on-boarding documentation that covers setup for new Kedro users. Also some major changes to the spaceflights tutorial to make it faster to work through. We think it's a better read. Tell us if it's not.
Bug fixes and other changes
- Added a guide and tooling for developing Kedro for Databricks.
- Implement missing dict-like interface for
_ProjectPipeline.
- Python
Published by idanov about 3 years ago
kedro - 0.18.6
Release 0.18.6
Bug fixes and other changes
- Fixed bug that didn't allow to read or write datasets with
s3aors3nfilepaths - Fixed bug with overriding nested parameters using the
--paramsflag - Fixed bug that made session store incompatible with
Kedro-Vizexperiment tracking
Migration guide from Kedro 0.18.5 to 0.18.6
A regression introduced in Kedro version 0.18.5 caused the Kedro-Viz console to fail to show experiment tracking correctly. If you experienced this issue, you will need to:
* upgrade to Kedro version 0.18.6
* delete any erroneous session entries created with Kedro 0.18.5 from your sessionstore.db stored at `
Thanks to Kedroids tomohiko kato, tsanikgr and maddataanalyst for very detailed reports about the bug.
- Python
Published by idanov about 3 years ago
kedro - 0.18.5
Release 0.18.5
NOTE: This version of Kedro introduced a bug such that the Kedro-Viz console to fail to show experiment tracking correctly. We recommend that you don't use it and prefer instead to use Kedro version
0.18.6.
Major features and improvements
- Added new
OmegaConfigLoaderwhich usesOmegaConffor loading and merging configuration. - Added the
--conf-sourceoption tokedro run, allowing users to specify a source for project configuration for the run. - Added
omegaconfsyntax as option for--params. Keys and values can now be separated by colons or equals signs. - Added support for generator functions as nodes, i.e. using
yieldinstead of return.- Enable chunk-wise processing in nodes with generator functions.
- Save node outputs after every
yieldbefore proceeding with next chunk.
- Fixed incorrect parsing of Azure Data Lake Storage Gen2 URIs used in datasets.
- Added support for loading credentials from environment variables using
OmegaConfigLoader. - Added new
--namespaceflag tokedro runto enable filtering by node namespace. - Added a new argument
nodefor all four dataset hooks. - Added the
kedro runflags--nodes,--tags, and--load-versionsto replace--node,--tag, and--load-version.
Bug fixes and other changes
- Commas surrounded by square brackets (only possible for nodes with default names) will no longer split the arguments to
kedro runoptions which take a list of nodes as inputs (--from-nodesand--to-nodes). - Fixed bug where
micropkgmanifest section inpyproject.tomlisn't recognised as allowed configuration. - Fixed bug causing
load_ipython_extensionnot to register the%reload_kedroline magic when called in a directory that does not contain a Kedro project. - Added
anyconfig'sac_contextparameter tokedro.config.commonsmodule functions for more flexibleConfigLoadercustomizations. - Change reference to
kedro.pipeline.Pipelineobject throughout test suite withkedro.modular_pipeline.pipelinefactory. - Fixed bug causing the
after_dataset_savedhook only to be called for one output dataset when multiple are saved in a single node and async saving is in use. - Log level for "Credentials not found in your Kedro project config" was changed from
WARNINGtoDEBUG. - Added safe extraction of tar files in
micropkg pullto fix vulnerability caused by CVE-2007-4559. - Documentation improvements
- Bug fix in table font size
- Updated API docs links for datasets
- Improved CLI docs for
kedro run - Revised documentation for visualisation to build plots and for experiment tracking
- Added example for loading external credentials to the Hooks documentation
Breaking changes to the API
Community contributions
Many thanks to the following Kedroids for contributing PRs to this release:
Upcoming deprecations for Kedro 0.19.0
project_versionwill be deprecated inpyproject.tomlplease usekedro_init_versioninstead.- Deprecated
kedro runflags--node,--tag, and--load-versionin favour of--nodes,--tags, and--load-versions.
- Python
Published by idanov over 3 years ago
kedro - 0.18.4
Major features and improvements
- Make Kedro instantiate datasets from
kedro_datasetswith higher priority thankedro.extras.datasets.kedro_datasetsis the namespace for the newkedro-datasetspython package. - The config loader objects now implement
UserDictand the configuration is accessed throughconf_loader['catalog']. - You can configure config file patterns through
settings.pywithout creating a custom config loader. - Added the following new datasets:
| Type | Description | Location |
| ------------------------------------ | -------------------------------------------------------------------------- | ----------------------------- |
| svmlight.SVMLightDataSet | Work with svmlight/libsvm files using scikit-learn library | kedro.extras.datasets.svmlight |
| video.VideoDataSet | Read and write video files from a filesystem | kedro.extras.datasets.video |
| video.video_dataset.SequenceVideo | Create a video object from an iterable sequence to use with VideoDataSet | kedro.extras.datasets.video |
| video.video_dataset.GeneratorVideo | Create a video object from a generator to use with VideoDataSet | kedro.extras.datasets.video |
* Implemented support for a functional definition of schema in dask.ParquetDataSet to work with the dask.to_parquet API.
Bug fixes and other changes
- Fixed
kedro micropkg pullfor packages on PyPI. - Fixed
formatinsave_argsforSparkHiveDataSet, previously it didn't allow you to save it as delta format. - Fixed save errors in
TensorFlowModelDatasetwhen used without versioning; previously, it wouldn't overwrite an existing model. - Added support for
tf.deviceinTensorFlowModelDataset. - Updated error message for
VersionNotFoundErrorto handle insufficient permission issues for cloud storage. - Updated Experiment Tracking docs with working examples.
- Updated MatplotlibWriter Dataset, TextDataset, plotly.PlotlyDataSet and plotly.JSONDataSet docs with working examples.
- Modified implementation of the Kedro IPython extension to use
local_nsrather than a global variable. - Refactored
ShelveStoreto its own module to ensure multiprocessing works with it. kedro.extras.datasets.pandas.SQLQueryDataSetnow takes optional argumentexecution_options.- Removed
attrsupper bound to support newer versions of Airflow. - Bumped the lower bound for the
setuptoolsdependency to <=61.5.1.
Minor breaking changes to the API
Upcoming deprecations for Kedro 0.19.0
kedro testandkedro lintwill be deprecated.
Documentation
- Revised the Introduction to shorten it
- Revised the Get Started section to remove unnecessary information and clarify the learning path
- Updated the spaceflights tutorial to simplify the later stages and clarify what the reader needed to do in each phase
- Moved some pages that covered advanced materials into more appropriate sections
- Moved visualisation into its own section
- Fixed a bug that degraded user experience: the table of contents is now sticky when you navigate between pages
- Added redirects where needed on ReadTheDocs for legacy links and bookmarks
Contributions from the Kedroid community
We are grateful to the following for submitting PRs that contributed to this release: jstammers, FlorianGD, yash6318, carlaprv, dinotuku, williamcaicedo, avan-sh, Kastakin, amaralbf, BSGalvan, levimjoseph, daniel-falk, clotildeguinard, avsolatorio, and picklejuicedev for comments and input to documentation changes
- Python
Published by idanov over 3 years ago
kedro - 0.18.3
Release 0.18.3
Major features and improvements
Implemented autodiscovery of project pipelines. A pipeline created with
kedro pipeline create <pipeline_name>can now be accessed immediately without needing to explicitly register it insrc/<package_name>/pipeline_registry.py, either individually by name (e.g.kedro run --pipeline=<pipeline_name>) or as part of the combined default pipeline (e.g.kedro run). By default, the simplifiedregister_pipelines()function inpipeline_registry.pylooks like:```python def register_pipelines() -> Dict[str, Pipeline]: """Register the project's pipelines.
Returns: A mapping from pipeline names to ``Pipeline`` objects. """ pipelines = find_pipelines() pipelines["__default__"] = sum(pipelines.values()) return pipelines```
The Kedro IPython extension should now be loaded with
%load_ext kedro.ipython.The line magic
%reload_kedronow accepts keywords arguments, e.g.%reload_kedro --env=prod.Improved resume pipeline suggestion for
SequentialRunner, it will backtrack the closest persisted inputs to resume.
Bug fixes and other changes
- Changed default
Falsevalue for rich loggingshow_locals, to make sure credentials and other sensitive data isn't shown in logs. - Rich traceback handling is disabled on Databricks so that exceptions now halt execution as expected. This is a workaround for a bug in
rich. - When using
kedro run -n [some_node], ifsome_nodeis missing a namespace the resulting error message will suggest the correct node name. - Updated documentation for
richlogging. - Updated Prefect deployment documentation to allow for reruns with saved versioned datasets.
- The Kedro IPython extension now surfaces errors when it cannot load a Kedro project.
- Relaxed
delta-sparkupper bound to allow compatibility with Spark 3.1.x and 3.2.x. - Added
gdriveto list of cloud protocols, enabling Google Drive paths for datasets. - Added svg logo resource for ipython kernel.
Upcoming deprecations for Kedro 0.19.0
- The Kedro IPython extension will no longer be available as
%load_ext kedro.extras.extensions.ipython; use%load_ext kedro.ipythoninstead. kedro jupyter convert,kedro build-docs,kedro build-reqsandkedro activate-nbstripoutwill be deprecated.
- Python
Published by idanov over 3 years ago
kedro - 0.18.2
Release 0.18.2
Major features and improvements
- Added
abfssto list of cloud protocols, enabling abfss paths. - Kedro now uses the Rich library to format terminal logs and tracebacks.
- The file
conf/base/logging.ymlis now optional. See our documentation for details. - Introduced a
kedro.startersentry point. This enables plugins to create custom starter aliases used bykedro starter listandkedro new. - Reduced the
kedro newprompts to just one question asking for the project name.
Bug fixes and other changes
- Bumped
pyyamlupper bound to make Kedro compatible with the pyodide stack. - Updated project template's Sphinx configuration to use
myst_parserinstead ofrecommonmark. - Reduced number of log lines by changing the logging level from
INFOtoDEBUGfor low priority messages. - Kedro's framework-side logging configuration no longer performs file-based logging. Hence superfluous
info.log/errors.logfiles are no longer created in your project root, and running Kedro on read-only file systems such as Databricks Repos is now possible. - The
rootlogger is now set to the Python default level ofWARNINGrather thanINFO. Kedro's logger is still set to emitINFOlevel messages. SequentialRunnernow has consistent execution order across multiple runs with sorted nodes.- Bumped the upper bound for the Flake8 dependency to <5.0.
kedro jupyter notebook/labno longer reuses a Jupyter kernel.- Required
cookiecutter>=2.1.1to address a known command injection vulnerability. - The session store no longer fails if a username cannot be found with
getpass.getuser. - Added generic typing for
AbstractDataSetandAbstractVersionedDataSetas well as typing to all datasets. - Rendered the deployment guide flowchart as a Mermaid diagram, and added Dask.
Minor breaking changes to the API
- The module
kedro.config.default_loggerno longer exists; default logging configuration is now set automatically throughkedro.framework.project.LOGGING. Unless you explicitly importkedro.config.default_loggeryou do not need to make any changes.
Upcoming deprecations for Kedro 0.19.0
kedro.extras.ColorHandlerwill be removed in 0.19.0.
- Python
Published by idanov almost 4 years ago
kedro - 0.18.1
Major features and improvements
- Added a new hook
after_context_createdthat passes theKedroContextinstance ascontext. - Added a new CLI hook
after_command_run. - Added more detail to YAML
ParserErrorexception error message. - Added option to
SparkDataSetto specify aschemaload argument that allows for supplying a user-defined schema as opposed to relying on the schema inference of Spark. - The Kedro package no longer contains a built version of the Kedro documentation significantly reducing the package size.
Bug fixes and other changes
- Removed fatal error from being logged when a Kedro session is created in a directory without git.
- Fixed
CONFIG_LOADER_CLASSvalidation so thatTemplatedConfigLoadercan be specified in settings.py. AnyCONFIG_LOADER_CLASSmust be a subclass ofAbstractConfigLoader. - Added runner name to the
run_paramsdictionary used in pipeline hooks. - Updated Databricks documentation to include how to get it working with IPython extension and Kedro-Viz.
- Update sections on visualisation, namespacing, and experiment tracking in the spaceflight tutorial to correspond to the complete spaceflights starter.
- Fixed
Jinja2syntax loading withTemplatedConfigLoaderusingglobals.yml. - Removed global
_active_session,_activate_sessionand_deactivate_session. Plugins that need to access objects such as the config loader should now do so throughcontextin the newafter_context_createdhook. config_loaderis available as a public read-only attribute ofKedroContext.- Made
hook_managerargument optional forrunner.run. kedro docsnow opens an online version of the Kedro documentation instead of a locally built version.
Upcoming deprecations for Kedro 0.19.0
kedro docswill be removed in 0.19.0.
- Python
Published by idanov about 4 years ago
kedro - 0.18.0
Release 0.18.0
TL;DR ✨
Kedro 0.18.0 strives to reduce the complexity of the project template and get us closer to a stable release of the framework. We've introduced the full micro-packaging workflow 📦, which allows you to import packages, utility functions and existing pipelines into your Kedro project. Integration with IPython and Jupyter has been streamlined in preparation for enhancements to Kedro's interactive workflow. Additionally, the release comes with long-awaited Python 3.9 and 3.10 support 🐍.
Major features and improvements
Framework
- Added
kedro.config.abstract_config.AbstractConfigLoaderas an abstract base class for allConfigLoaderimplementations.ConfigLoaderandTemplatedConfigLoadernow inherit directly from this base class. - Streamlined the
ConfigLoader.getandTemplatedConfigLoader.getAPI and delegated the actualgetmethod functional implementation to thekedro.config.commonmodule. - The
hook_manageris no longer a global singleton. Thehook_managerlifecycle is now managed by theKedroSession, and a newhook_managerwill be created every time asessionis instantiated. - Added support for specifying parameters mapping in
pipeline()without theparams:prefix. - Added new API
Pipeline.filter()(previously inKedroContext._filter_pipeline()) to filter parts of a pipeline. - Added
usernameto Session store for logging during Experiment Tracking. - A packaged Kedro project can now be imported and run from another Python project as following: ```python from mypackage.main_ import main
main( ["--pipleine", "my_pipeline"] ) # or just main() if no parameters are needed for the run ```
Project template
- Removed
cli.pyfrom the Kedro project template. By default, all CLI commands, includingkedro run, are now defined on the Kedro framework side. You can still define custom CLI commands by creating your owncli.py. - Removed
hooks.pyfrom the Kedro project template. Registration hooks have been removed in favour ofsettings.pyconfiguration, but you can still define execution timeline hooks by creating your ownhooks.py. - Removed
.ipythondirectory from the Kedro project template. The IPython/Jupyter workflow no longer uses IPython profiles; it now uses an IPython extension. - The default
kedrorun configuration environment names can now be set insettings.pyusing theCONFIG_LOADER_ARGSvariable. The relevant keyword arguments to supply arebase_envanddefault_run_env, which are set tobaseandlocalrespectively by default.
DataSets
- Added the following new datasets:
| Type | Description | Location |
| ------------------------- | ------------------------------------------------------------- | -------------------------------- |
| pandas.XMLDataSet | Read XML into Pandas DataFrame. Write Pandas DataFrame to XML | kedro.extras.datasets.pandas |
| networkx.GraphMLDataSet | Work with NetworkX using GraphML files | kedro.extras.datasets.networkx |
| networkx.GMLDataSet | Work with NetworkX using Graph Modelling Language files | kedro.extras.datasets.networkx |
| redis.PickleDataSet | loads/saves data from/to a Redis database | kedro.extras.datasets.redis |
- Added
partitionBysupport and exposedsave_argsforSparkHiveDataSet. - Exposed
open_args_saveinfs_argsforpandas.ParquetDataSet. - Refactored the
loadandsaveoperations forpandasdatasets in order to leveragepandasown API and delegatefsspecoperations to them. This reduces the need to have our ownfsspecwrappers. - Merged
pandas.AppendableExcelDataSetintopandas.ExcelDataSet. - Added
save_argstofeather.FeatherDataSet.
Jupyter and IPython integration
- The only recommended way to work with Kedro in Jupyter or IPython is now the Kedro IPython extension. Managed Jupyter instances should load this via
%load_ext kedro.extras.extensions.ipythonand use the line magic%reload_kedro. kedro ipythonlaunches an IPython session that preloads the Kedro IPython extension.kedro jupyter notebook/labcreates a custom Jupyter kernel that preloads the Kedro IPython extension and launches a notebook with that kernel selected. There is no longer a need to specify--all-kernelsto show all available kernels.
Dependencies
- Bumped the minimum version of
pandasto 1.3. Anystorage_optionsshould continue to be specified underfs_argsand/orcredentials. - Added support for Python 3.9 and 3.10, dropped support for Python 3.6.
- Updated
blackdependency in the project template to a non pre-release version.
Other
- Documented distribution of Kedro pipelines with Dask.
Breaking changes to the API
Framework
- Removed
RegistrationSpecsand its associatedregister_config_loaderandregister_cataloghook specifications in favour ofCONFIG_LOADER_CLASS/CONFIG_LOADER_ARGSandDATA_CATALOG_CLASSinsettings.py. - Removed deprecated functions
load_contextandget_project_context. - Removed deprecated
CONF_SOURCE,package_name,pipeline,pipelines,config_loaderandioattributes fromKedroContextas well as the deprecatedKedroContext.runmethod. - Added the
PluginManagerhook_managerargument toKedroContextand theRunner.run()method, which will be provided by theKedroSession. - Removed the public method
get_hook_manager()and replaced its functionality by_create_hook_manager(). - Enforced that only one run can be successfully executed as part of a
KedroSession.run_idhas been renamed tosession_idas a result.
Configuration loaders
- The
settings.pysettingCONF_ROOThas been renamed toCONF_SOURCE. Default value ofconfremains unchanged. ConfigLoaderandTemplatedConfigLoaderargumentconf_roothas been renamed toconf_source.extra_paramshas been renamed toruntime_paramsinkedro.config.config.ConfigLoaderandkedro.config.templated_config.TemplatedConfigLoader.- The environment defaulting behaviour has been removed from
KedroContextand is now implemented in aConfigLoaderclass (or equivalent) with thebase_envanddefault_run_envattributes.
DataSets
pandas.ExcelDataSetnow usesopenpyxlengine instead ofxlrd.pandas.ParquetDataSetnow callspd.to_parquet()upon saving. Note that the argumentpartition_colsis not supported.spark.SparkHiveDataSetAPI has been updated to reflectspark.SparkDataSet. Thewrite_mode=insertoption has also been replaced withwrite_mode=appendas per Spark styleguide. This change addresses Issue 725 and Issue 745. Additionally,upsertmode now leveragescheckpointfunctionality and requires a validcheckpointDirbe set for currentSparkContext.yaml.YAMLDataSetcan no longer save apandas.DataFramedirectly, but it can save a dictionary. Usepandas.DataFrame.to_dict()to convert yourpandas.DataFrameto a dictionary before you attempt to save it to YAML.- Removed
open_args_loadandopen_args_savefrom the following datasets:pandas.CSVDataSetpandas.ExcelDataSetpandas.FeatherDataSetpandas.JSONDataSetpandas.ParquetDataSet
storage_optionsare now dropped if they are specified underload_argsorsave_argsfor the following datasets:pandas.CSVDataSetpandas.ExcelDataSetpandas.FeatherDataSetpandas.JSONDataSetpandas.ParquetDataSet
- Renamed
lambda_data_set,memory_data_set, andpartitioned_data_settolambda_dataset,memory_dataset, andpartitioned_dataset, respectively, inkedro.io. - The dataset
networkx.NetworkXDataSethas been renamed tonetworkx.JSONDataSet.
CLI
- Removed
kedro installin favour ofpip install -r src/requirements.txtto install project dependencies. - Removed
--parallelflag fromkedro runin favour of--runner=ParallelRunner. The-pflag is now an alias for--pipeline. kedro pipeline packagehas been replaced bykedro micropkg packageand, in addition to the--aliasflag used to rename the package, now accepts a module name and path to the pipeline or utility module to package, relative tosrc/<package_name>/. The--versionCLI option has been removed in favour of setting a__version__variable in the micro-package's__init__.pyfile.kedro pipeline pullhas been replaced bykedro micropkg pulland now also supports--destinationto provide a location for pulling the package.- Removed
kedro pipeline listandkedro pipeline describein favour ofkedro registry listandkedro registry describe. kedro packageandkedro micropkg packagenow saveeggandwhlortarfiles in the<project_root>/distfolder (previously<project_root>/src/dist).- Changed the behaviour of
kedro build-reqsto compile requirements fromrequirements.txtinstead ofrequirements.inand save them torequirements.lockinstead ofrequirements.txt. kedro jupyter notebook/labno longer accept--all-kernelsor--idle-timeoutflags.--all-kernelsis now the default behaviour.KedroSession.runnow raisesValueErrorrather thanKedroContextErrorwhen the pipeline contains no nodes. The sameValueErroris raised when there are no matching tags.KedroSession.runnow raisesValueErrorrather thanKedroContextErrorwhen the pipeline name doesn't exist in the pipeline registry.
Other
- Added namespace to parameters in a modular pipeline, which addresses Issue 399.
- Switched from packaging pipelines as wheel files to tar archive files compressed with gzip (
.tar.gz). - Removed decorator API from
NodeandPipeline, as well as the moduleskedro.extras.decoratorsandkedro.pipeline.decorators. - Removed transformer API from
DataCatalog, as well as the moduleskedro.extras.transformersandkedro.io.transformers. - Removed the
JournalandDataCatalogWithDefault. - Removed
%init_kedroIPython line magic, with its functionality incorporated into%reload_kedro. This means that if%reload_kedrois called with a filepath, that will be set as default for subsequent calls.
Migration guide from Kedro 0.17.* to 0.18.*
Hooks
- Remove any existing
hook_implof theregister_config_loaderandregister_catalogmethods fromProjectHooksinhooks.py(or custom alternatives). - If you use
run_idin theafter_catalog_createdhook, replace it withsave_versioninstead. - If you use
run_idin any of thebefore_node_run,after_node_run,on_node_error,before_pipeline_run,after_pipeline_runoron_pipeline_errorhooks, replace it withsession_idinstead.
settings.py file
- If you use a custom config loader class such as
kedro.config.TemplatedConfigLoader, alterCONFIG_LOADER_CLASSto specify the class andCONFIG_LOADER_ARGSto specify keyword arguments. If not set, these default tokedro.config.ConfigLoaderand an empty dictionary respectively. - If you use a custom data catalog class, alter
DATA_CATALOG_CLASSto specify the class. If not set, this defaults tokedro.io.DataCatalog. - If you have a custom config location (i.e. not
conf), updateCONF_ROOTtoCONF_SOURCEand set it to a string with the expected configuration location. If not set, this defaults to"conf".
Modular pipelines
- If you use any modular pipelines with parameters, make sure they are declared with the correct namespace. See example below:
For a given pipeline:
python
active_pipeline = pipeline(
pipe=[
node(
func=some_func,
inputs=["model_input_table", "params:model_options"],
outputs=["**my_output"],
),
...,
],
inputs="model_input_table",
namespace="candidate_modelling_pipeline",
)
The parameters should look like this:
```diff -modeloptions: - testsize: 0.2 - randomstate: 8 - features: - - engines - - passengercapacity - - crew +candidatemodellingpipeline: + modeloptions: + testsize: 0.2 + randomstate: 8 + features: + - engines + - passengercapacity + - crew
``
* Optional: You can now remove allparams:prefix when supplying values toparametersargument in apipeline()call.
* If you pull modular pipelines withkedro pipeline pull mypipeline --alias otherpipeline, now usekedro micropkg pull mypipeline --alias pipelines.otherpipelineinstead.
* If you package modular pipelines withkedro pipeline package mypipeline, now usekedro micropkg package pipelines.mypipelineinstead.
* Similarly, if you package any modular pipelines usingpyproject.toml`, you should modify the keys to include the full module path, and wrapped in double-quotes, e.g:
```diff [tool.kedro.micropkg.package] -dataengineering = {destination = "path/to/here"} -datascience = {alias = "ds", env = "local"} +"pipelines.dataengineering" = {destination = "path/to/here"} +"pipelines.datascience" = {alias = "ds", env = "local"}
[tool.kedro.micropkg.pull] -"s3://mybucket/mypipeline" = {alias = "aliasedpipeline"} +"s3://mybucket/mypipeline" = {alias = "pipelines.aliasedpipeline"} ```
DataSets
- If you use
pandas.ExcelDataSet, make sure you haveopenpyxlinstalled in your environment. This is automatically installed if you specifykedro[pandas.ExcelDataSet]==0.18.0in yourrequirements.txt. You can uninstallxlrdif you were only using it for this dataset. - If you use
pandas.ParquetDataSet, pass pandas saving arguments directly tosave_argsinstead of nested infrom_pandas(e.g.save_args = {"preserve_index": False}instead ofsave_args = {"from_pandas": {"preserve_index": False}}). - If you use
spark.SparkHiveDataSetwithwrite_modeoption set toinsert, change this toappendin line with the Spark styleguide. If you usespark.SparkHiveDataSetwithwrite_modeoption set toupsert, make sure that yourSparkContexthas a validcheckpointDirset either bySparkContext.setCheckpointDirmethod or directly in theconffolder. - If you use
pandas~=1.2.0and passstorage_optionsthroughload_argsorsavs_args, specify them underfs_argsor viacredentialsinstead. - If you import from
kedro.io.lambda_data_set,kedro.io.memory_data_set, orkedro.io.partitioned_data_set, change the import tokedro.io.lambda_dataset,kedro.io.memory_dataset, orkedro.io.partitioned_dataset, respectively (or import the dataset directly fromkedro.io). - If you have any
pandas.AppendableExcelDataSetentries in your catalog, replace them withpandas.ExcelDataSet. - If you have any
networkx.NetworkXDataSetentries in your catalog, replace them withnetworkx.JSONDataSet.
Other
- Edit any scripts containing
kedro pipeline package --versionto usekedro micropkg packageinstead. If you wish to set a specific pipeline package version, set the__version__variable in the pipeline package's__init__.pyfile. - To run a pipeline in parallel, use
kedro run --runner=ParallelRunnerrather than--parallelor-p. - If you call
ConfigLoaderorTemplatedConfigLoaderdirectly, update the keyword argumentsconf_roottoconf_sourceandextra_paramstoruntime_params. - If you use
KedroContextto accessConfigLoader, usesettings.CONFIG_LOADER_CLASSto access the currently usedConfigLoaderinstead.
- Python
Published by idanov about 4 years ago
kedro - 0.17.7
Release 0.17.7
Major features and improvements
pipelinenow acceptstagsand a collection ofNodes and/orPipelines rather than just a singlePipelineobject.pipelineshould be used in preference toPipelinewhen creating a Kedro pipeline.pandas.SQLTableDataSetandpandas.SQLQueryDataSetnow only open one connection per database, at instantiation time (therefore at catalog creation time), rather than one per load/save operation.- Added new command group,
micropkg, to replacekedro pipeline pullandkedro pipeline packagewithkedro micropkg pullandkedro micropkg packagefor Kedro 0.18.0.kedro micropkg packagesaves packages toproject/distwhilekedro pipeline packagesaves packages toproject/src/dist.
Bug fixes and other changes
- Added tutorial documentation for experiment tracking.
- Added Plotly dataset documentation.
- Added the upper limit
pandas<1.4to maintain compatibility withxlrd~=1.0. - Bumped the
Pillowminimum version requirement to 9.0 (Python 3.7+ only) following CVE-2022-22817. - Fixed
PickleDataSetto be copyable and hence work with the parallel runner. - Upgraded
pip-tools, which is used bykedro build-reqs, to 6.5 (Python 3.7+ only). Thispip-toolsversion is compatible withpip>=21.2, including the most recent releases ofpip. Python 3.6 users should continue to usepip-tools6.4 andpip<22. - Added
astro-irisas alias forastro-airlow-iris, so that old tutorials can still be followed. - Added details about Kedro's Technical Steering Committee and governance model.
Upcoming deprecations for Kedro 0.18.0
kedro pipeline pullandkedro pipeline packagewill be deprecated. Please usekedro micropkginstead.
- Python
Published by idanov over 4 years ago
kedro - 0.17.6
Release 0.17.6
Major features and improvements
- Added
pipelinesglobal variable to IPython extension, allowing you to access the project's pipelines inkedro ipythonorkedro jupyter notebook. - Enabled overriding nested parameters with
paramsin CLI, i.e.kedro run --params="model.model_tuning.booster:gbtree"updates parameters to{"model": {"model_tuning": {"booster": "gbtree"}}}. - Added option to
pandas.SQLQueryDataSetto specify afilepathwith a SQL query, in addition to the current method of supplying the query itself in thesqlargument. - Extended
ExcelDataSetto support saving Excel files with multiple sheets. - Added the following new datasets:
| Type | Description | Location |
| --------------------------- | ---------------------------------------------------- | --------------------------------- |
| plotly.JSONDataSet | Works with plotly graph object Figures (saves as json file) | kedro.extras.datasets.plotly |
| pandas.GenericDataSet | Provides a 'best effort' facility to read / write any format provided by the pandas library | kedro.extras.datasets.pandas |
| pandas.GBQQueryDataSet | Loads data from a Google Bigquery table using provided SQL query | kedro.extras.datasets.pandas |
| spark.DeltaTableDataSet | Dataset designed to handle Delta Lake Tables and their CRUD-style operations, including update, merge and delete | kedro.extras.datasets.spark |
Bug fixes and other changes
- Fixed an issue where
kedro new --config config.ymlwas ignoring the config file whenprompts.ymldidn't exist. - Added documentation for
kedro viz --autoreload. - Added support for arbitrary backends (via importable module paths) that satisfy the
pickleinterface toPickleDataSet. - Added support for
sumsyntax for connecting pipeline objects. - Upgraded
pip-tools, which is used bykedro build-reqs, to 6.4. Thispip-toolsversion requirespip>=21.2while adding support forpip>=21.3. To upgradepip, please refer to their documentation. - Relaxed the bounds on the
plotlyrequirement forplotly.PlotlyDataSetand thepyarrowrequirement forpandas.ParquetDataSet. kedro pipeline package <pipeline>now raises an error if the<pipeline>argument doesn't look like a valid Python module path (e.g. has/instead of.).- Added new
overwriteargument toPartitionedDataSetandMatplotlibWriterto enable deletion of existing partitions and plots on datasetsave. kedro pipeline pullnow works when the project requirements contains entries such as-r,--extra-index-urland local wheel files (Issue #913).- Fixed slow startup because of catalog processing by reducing the exponential growth of extra processing during
_FrozenDatasetscreations. - Removed
.coveragercfrom the Kedro project template.coveragesettings are now given inpyproject.toml. - Fixed a bug where packaging or pulling a modular pipeline with the same name as the project's package name would throw an error (or silently pass without including the pipeline source code in the wheel file).
- Removed unintentional dependency on
git. - Fixed an issue where nested pipeline configuration was not included in the packaged pipeline.
- Deprecated the "Thanks for supporting contributions" section of release notes to simplify the contribution process; Kedro 0.17.6 is the last release that includes this. This process has been replaced with the automatic GitHub feature.
- Fixed a bug where the version on the tracking datasets didn't match the session id and the versions of regular versioned datasets.
- Fixed an issue where datasets in
load_versionsthat are not found in the data catalog would silently pass. - Altered the string representation of nodes so that node inputs/outputs order is preserved rather than being alphabetically sorted.
Upcoming deprecations for Kedro 0.18.0
kedro.extras.decoratorsandkedro.pipeline.decoratorsare being deprecated in favour of Hooks.kedro.extras.transformersandkedro.io.transformersare being deprecated in favour of Hooks.- The
--parallelflag onkedro runis being removed in favour of--runner=ParallelRunner. The-pflag will change to be an alias for--pipeline. kedro.io.DataCatalogWithDefaultis being deprecated, to be removed entirely in 0.18.0.
Thanks for supporting contributions
Deepyaman Datta, Brites, Manish Swami, Avaneesh Yembadi, Zain Patel, Simon Brugman, Kiyo Kunii, Benjamin Levy, Louis de Charsonville, Simon Picard
- Python
Published by idanov over 4 years ago
kedro - 0.17.5
Release 0.17.5
Major features and improvements
- Added new CLI group
registry, with the associated commandskedro registry listandkedro registry describe, to replacekedro pipeline listandkedro pipeline describe. - Added support for dependency management at a modular pipeline level. When a pipeline with
requirements.txtis packaged, its dependencies are embedded in the modular pipeline wheel file. Upon pulling the pipeline, Kedro will append dependencies to the project'srequirements.in. More information is available in our documentation. - Added support for bulk packaging/pulling modular pipelines using
kedro pipeline package/pull --allandpyproject.toml. - Removed
cli.pyfrom the Kedro project template. By default all CLI commands, includingkedro run, are now defined on the Kedro framework side. These can be overridden in turn by a plugin or acli.pyfile in your project. A packaged Kedro project will respect the same hierarchy when executed withpython -m my_package. - Removed
.ipython/profile_default/startup/from the Kedro project template in favour of.ipython/profile_default/ipython_config.pyand thekedro.extras.extensions.ipython. - Added support for
dillbackend toPickleDataSet. - Imports are now refactored at
kedro pipeline packageandkedro pipeline pulltime, so that aliasing a modular pipeline doesn't break it. - Added the following new datasets to support basic Experiment Tracking:
| Type | Description | Location |
| --------------------------- | ---------------------------------------------------- | --------------------------------- |
| tracking.MetricsDataSet | Dataset to track numeric metrics for experiment tracking | kedro.extras.datasets.tracking |
| tracking.JSONDataSet | Dataset to track data for experiment tracking | kedro.extras.datasets.tracking |
Bug fixes and other changes
- Bumped minimum required
fsspecversion to 2021.04. - Fixed the
kedro installandkedro build-reqsflows when uninstalled dependencies are present in a project'ssettings.py,context.pyorhooks.py(Issue #829). - Imports are now refactored at
kedro pipeline packageandkedro pipeline pulltime, so that aliasing a modular pipeline doesn't break it. - Pinned
dynaconfto<3.1.6because the method signature for_validate_itemschanged which is used in Kedro.
Minor breaking changes to the API
Upcoming deprecations for Kedro 0.18.0
kedro pipeline listandkedro pipeline describeare being deprecated in favour of new commandskedro registry listandkedro registry describe.kedro installis being deprecated in favour of usingpip install -r src/requirements.txtto install project dependencies.
Thanks for supporting contributions
- Python
Published by idanov over 4 years ago
kedro - 0.17.4
Release 0.17.4
Major features and improvements
- Added the following new datasets:
| Type | Description | Location |
| --------------------------- | ---------------------------------------------------- | --------------------------------- |
| plotly.PlotlyDataSet | Works with plotly graph object Figures (saves as json file) | kedro.extras.datasets.plotly |
Bug fixes and other changes
- Defined our set of Kedro Principles! Have a read through our docs.
ConfigLoader.get()now raises aBadConfigException, with a more helpful error message, if a configuration file cannot be loaded (for instance due to wrong syntax or poor formatting).run_idnow defaults tosave_versionwhenafter_catalog_createdis called, similarly to what happens during akedro run.- Fixed a bug where
kedro ipythonandkedro jupyter notebookdidn't work if thePYTHONPATHwas already set. - Update the IPython extension to allow passing
envandextra_paramstoreload_kedrosimilar to how the IPython script works. kedro infonow outputs if a plugin has anyhooksorcli_hooksimplemented.PartitionedDataSetnow supports lazily materializing data on save.kedro pipeline describenow defaults to the__default__pipeline when no pipeline name is provided and also shows the namespace the nodes belong to.- Fixed an issue where spark.SparkDataSet with enabled versioning would throw a VersionNotFoundError when using databricks-connect from a remote machine and saving to dbfs filesystem.
EmailMessageDataSetadded to doctree.- When node inputs do not pass validation, the error message is now shown as the most recent exception in the traceback (Issue #761).
kedro pipeline packagenow only packages the parameter file that exactly matches the pipeline name specified and the parameter files in a directory with the pipeline name.- Extended support to newer versions of third-party dependencies (Issue #735).
- Ensured consistent references to
model inputtables in accordance with our Data Engineering convention. - Changed behaviour where
kedro pipeline packagetakes the pipeline package version, rather than the kedro package version. If the pipeline package version is not present, then the package version is used. - Launched GitHub Discussions and Kedro Discord Server
- Improved error message when versioning is enabled for a dataset previously saved as non-versioned (Issue #625).
- Python
Published by idanov almost 5 years ago
kedro - 0.17.3
Release 0.17.3
Major features and improvements
- Kedro plugins can now override built-in CLI commands.
- Added a
before_command_runhook for plugins to add extra behaviour before Kedro CLI commands run. pipelinesfrompipeline_registry.pyandregister_pipelinehooks are now loaded lazily when they are first accessed, not on startup:
```python from kedro.framework.project import pipelines
print(pipelines["default"]) # pipeline loading is only triggered here ```
Bug fixes and other changes
TemplatedConfigLoadernow correctly inserts default values when no globals are supplied.- Fixed a bug where the
KEDRO_ENVenvironment variable had no effect on instantiating thecontextvariable in an iPython session or a Jupyter notebook. - Plugins with empty CLI groups are no longer displayed in the Kedro CLI help screen.
- Duplicate commands will no longer appear twice in the Kedro CLI help screen.
- CLI commands from sources with the same name will show under one list in the help screen.
- The setup of a Kedro project, including adding src to path and configuring settings, is now handled via the
bootstrap_projectmethod. configure_projectis invoked if apackage_nameis supplied toKedroSession.create. This is added for backward-compatibility purpose to support a workflow that createsSessionmanually. It will be removed in0.18.0.- Stopped swallowing up all
ModuleNotFoundErrorifregister_pipelinesnot found, so that a more helpful error message will appear when a dependency is missing, e.g. Issue #722. - When
kedro newis invoked using a configuration yaml file,output_diris no longer a required key; by default the current working directory will be used. - When
kedro newis invoked using a configuration yaml file, the appropriateprompts.ymlfile is now used for validating the provided configuration. Previously, validation was always performed against the kedro project templateprompts.ymlfile. - When a relative path to a starter template is provided,
kedro newnow generates user prompts to obtain configuration rather than supplying empty configuration. - Fixed error when using starters on Windows with Python 3.7 (Issue #722).
- Fixed decoding error of config files that contain accented characters by opening them for reading in UTF-8.
- Fixed an issue where
after_dataset_loadedrun would finish before a dataset is actually loaded when using--asyncflag.
Upcoming deprecations for Kedro 0.18.0
kedro.versioning.journal.Journalwill be removed.- The following properties on
kedro.framework.context.KedroContextwill be removed:ioin favour ofKedroContext.catalogpipeline(equivalent topipelines["__default__"])pipelinesin favour ofkedro.framework.project.pipelines
- Python
Published by idanov about 5 years ago
kedro - 0.17.2
Release 0.17.2
Major features and improvements
- Added support for
compress_picklebackend toPickleDataSet. - Enabled loading pipelines without creating a
KedroContextinstance:
```python from kedro.framework.project import pipelines
print(pipelines) ```
- Projects generated with kedro>=0.17.2:
- should define pipelines in
pipeline_registry.pyrather thanhooks.py. - when run as a package, will behave the same as
kedro run
- should define pipelines in
Bug fixes and other changes
- If
settings.pyis not importable, the errors will be surfaced earlier in the process, rather than at runtime.
Minor breaking changes to the API
kedro pipeline listandkedro pipeline describeno longer accept redundant--envparameter.from kedro.framework.cli.cli import clino longer includes thenewandstartercommands.
Upcoming deprecations for Kedro 0.18.0
kedro.framework.context.KedroContext.runwill be removed in release 0.18.0.
Thanks for supporting contributions
- Python
Published by idanov about 5 years ago
kedro - 0.17.1
Release 0.17.1
Major features and improvements
- Added
envandextra_paramstoreload_kedro()line magic. - Extended the
pipeline()API to allow strings and sets of strings asinputsandoutputs, to specify when a dataset name remains the same (not namespaced). - Added the ability to add custom prompts with regexp validator for starters by repurposing
default_config.ymlasprompts.yml. - Added the
envandextra_paramsarguments toregister_config_loaderhook. - Refactored the way
settingsare loaded. You will now be able to run:
```python from kedro.framework.project import settings
print(settings.CONF_ROOT) ```
Bug fixes and other changes
- The version of a packaged modular pipeline now defaults to the version of the project package.
- Added fix to prevent new lines being added to pandas CSV datasets.
- Fixed issue with loading a versioned
SparkDataSetin the interactive workflow. - Kedro CLI now checks
pyproject.tomlfor atool.kedrosection before treating the project as a Kedro project. - Added fix to
DataCatalog::shallow_copynow it should copy layers. kedro pipeline pullnow usespip downloadfor protocols that are not supported byfsspec.- Cleaned up documentation to fix broken links and rewrite permanently redirected ones.
- Added a
jsonschemaschema definition for the Kedro 0.17 catalog. kedro installnow waits on Windows until all the requirements are installed.- Exposed
--to-outputsoption in the CLI, throughout the codebase, and as part of hooks specifications. - Fixed a bug where
ParquetDataSetwasn't creating parent directories on the fly. - Updated documentation.
Breaking changes to the API
- This release has broken the
kedro ipythonandkedro jupyterworkflows. To fix this, follow the instructions in the migration guide below.
Note: If you're using the
ipythonextension instead, you will not encounter this problem.
Migration guide
You will have to update the file <your_project>/.ipython/profile_default/startup/00-kedro-init.py in order to make kedro ipython and/or kedro jupyter work. Add the following line before the KedroSession is created:
```python configureproject(metadata.packagename) # to add
session = KedroSession.create(metadata.package_name, path) ```
Make sure that the associated import is provided in the same place as others in the file:
python
from kedro.framework.project import configure_project # to add
from kedro.framework.session import KedroSession
Thanks for supporting contributions
Mariana Silva, Kiyohito Kunii, noklam, Ivan Doroshenko, Zain Patel, Deepyaman Datta, Sam Hiscox, Pascal Brokmeier
- Python
Published by idanov about 5 years ago
kedro - 0.17.0
Release 0.17.0
Major features and improvements
- In a significant change, we have introduced
KedroSessionwhich is responsible for managing the lifecycle of a Kedro run. - Created a new Kedro Starter:
kedro new --starter=mini-kedro. It is possible to use the DataCatalog as a standalone component in a Jupyter notebook and transition into the rest of the Kedro framework. - Added
DatasetSpecswith Hooks to run before and after datasets are loaded from/saved to the catalog. - Added a command:
kedro catalog create. For a registered pipeline, it creates a<conf_root>/<env>/catalog/<pipeline_name>.ymlconfiguration file withMemoryDataSetdatasets for each dataset that is missing fromDataCatalog. - Added
settings.pyandpyproject.toml(to replace.kedro.yml) for project configuration, in line with Python best practice. ProjectContextis no longer needed, unless for very complex customisations.KedroContext,ProjectHooksandsettings.pytogether implement sensible default behaviour. As a resultcontext_pathis also now an optional key inpyproject.toml.- Removed
ProjectContextfromsrc/<package_name>/run.py. TemplatedConfigLoadernow supports Jinja2 template syntax alongside its original syntax.- Made registration Hooks mandatory, as the only way to customise the
ConfigLoaderor theDataCatalogused in a project. If no such Hook is provided insrc/<package_name>/hooks.py, aKedroContextErroris raised. There are sensible defaults defined in any project generated with Kedro >= 0.16.5.
Bug fixes and other changes
ParallelRunnerno longer results in a run failure, when triggered from a notebook, if the run is started usingKedroSession(session.run()).before_node_runcan now overwrite node inputs by returning a dictionary with the corresponding updates.- Added minimal, black-compatible flake8 configuration to the project template.
- Moved
isortandpytestconfiguration from<project_root>/setup.cfgto<project_root>/pyproject.toml. - Extra parameters are no longer incorrectly passed from
KedroSessiontoKedroContext. - Relaxed
pysparkrequirements to allow for installation ofpyspark3.0. - Added a
--fs-argsoption to thekedro pipeline pullcommand to specify configuration options for thefsspecfilesystem arguments used when pulling modular pipelines from non-PyPI locations. - Bumped maximum required
fsspecversion to 0.9. - Bumped maximum supported
s3fsversion to 0.5 (S3FileSysteminterface has changed since 0.4.1 version).
Deprecations
- In Kedro 0.17.0 we have deleted the deprecated
kedro.cliandkedro.contextmodules in favour ofkedro.framework.cliandkedro.framework.contextrespectively.
Other breaking changes to the API
kedro.io.DataCatalog.exists()returnsFalsewhen the dataset does not exist, as opposed to raising an exception.- The pipeline-specific
catalog.ymlfile is no longer automatically created for modular pipelines when runningkedro pipeline create. Usekedro catalog createto replace this functionality. - Removed
include_examplesprompt fromkedro new. To generate boilerplate example code, you should use a Kedro starter. - Changed the
--verboseflag from a global command to a project-specific command flag (e.gkedro --verbose newbecomeskedro new --verbose). - Dropped support of the
dataset_credentialskey in credentials inPartitionedDataSet. get_source_dir()was removed fromkedro/framework/cli/utils.py.- Dropped support of
get_config,create_catalog,create_pipeline,template_version,project_nameandproject_pathkeys byget_project_context()function (kedro/framework/cli/cli.py). kedro new --starternow defaults to fetching the starter template matching the installed Kedro version.- Renamed
kedro_cli.pytocli.pyand moved it inside the Python package (src/<package_name>/), for a better packaging and deployment experience. - Removed
.kedro.ymlfrom the project template and replaced it withpyproject.toml. - Removed
KEDRO_CONFIGSconstant (previously residing inkedro.framework.context.context). - Modified
kedro pipeline createCLI command to add a boilerplate parameter config file inconf/<env>/parameters/<pipeline_name>.ymlinstead ofconf/<env>/pipelines/<pipeline_name>/parameters.yml. CLI commandskedro pipeline delete/package/pullwere updated accordingly. - Removed
get_static_project_datafromkedro.framework.context. - Removed
KedroContext.static_data. - The
KedroContextconstructor now takespackage_nameas first argument. - Replaced
contextproperty onKedroSessionwithload_context()method. - Renamed
_push_sessionand_pop_sessioninkedro.framework.session.sessionto_activate_sessionand_deactivate_sessionrespectively. - Custom context class is set via
CONTEXT_CLASSvariable insrc/<your_project>/settings.py. - Removed
KedroContext.hooksattribute. Instead, hooks should be registered insrc/<your_project>/settings.pyunder theHOOKSkey. - Restricted names given to nodes to match the regex pattern
[\w\.-]+$. - Removed
KedroContext._create_config_loader()andKedroContext._create_data_catalog(). They have been replaced by registration hooks, namelyregister_config_loader()andregister_catalog()(see also upcoming deprecations).
Upcoming deprecations for Kedro 0.18.0
kedro.framework.context.load_contextwill be removed in release 0.18.0.kedro.framework.cli.get_project_contextwill be removed in release 0.18.0.- We've added a
DeprecationWarningto the decorator API for bothnodeandpipeline. These will be removed in release 0.18.0. Use Hooks to extend a node's behaviour instead. - We've added a
DeprecationWarningto the Transformers API when adding a transformer to the catalog. These will be removed in release 0.18.0. Use Hooks to customise theloadandsavemethods.
Thanks for supporting contributions
Deepyaman Datta, Zach Schuster
Migration guide from Kedro 0.16.* to 0.17.*
Reminder: Our documentation on how to upgrade Kedro covers a few key things to remember when updating any Kedro version.
The Kedro 0.17.0 release contains some breaking changes. If you update Kedro to 0.17.0 and then try to work with projects created against earlier versions of Kedro, you may encounter some issues when trying to run kedro commands in the terminal for that project. Here's a short guide to getting your projects running against the new version of Kedro.
Note: As always, if you hit any problems, please check out our documentation: * How can I find out more about Kedro? * How can I get my questions answered?.
To get an existing Kedro project to work after you upgrade to Kedro 0.17.0, we recommend that you create a new project against Kedro 0.17.0 and move the code from your existing project into it. Let's go through the changes, but first, note that if you create a new Kedro project with Kedro 0.17.0 you will not be asked whether you want to include the boilerplate code for the Iris dataset example. We've removed this option (you should now use a Kedro starter if you want to create a project that is pre-populated with code).
To create a new, blank Kedro 0.17.0 project to drop your existing code into, you can create one, as always, with kedro new. We also recommend creating a new virtual environment for your new project, or you might run into conflicts with existing dependencies.
- Update
pyproject.toml: Copy the following three keys from the.kedro.ymlof your existing Kedro project into thepyproject.tomlfile of your new Kedro 0.17.0 project:
toml
[tools.kedro]
package_name = "<package_name>"
project_name = "<project_name>"
project_version = "0.17.0"
Check your source directory. If you defined a different source directory (source_dir), make sure you also move that to pyproject.toml.
Copy files from your existing project:
- Copy subfolders of
project/src/project_name/pipelinesfrom existing to new project - Copy subfolders of
project/src/test/pipelinesfrom existing to new project - Copy the requirements your project needs into
requirements.txtand/orrequirements.in. - Copy your project configuration from the
conffolder. Take note of the new locations needed for modular pipeline configuration (move it fromconf/<env>/pipeline_name/catalog.ymltoconf/<env>/catalog/pipeline_name.ymland likewise forparameters.yml). - Copy from the
data/folder of your existing project, if needed, into the same location in your new project. - Copy any Hooks from
src/<package_name>/hooks.py.
- Copy subfolders of
Update your new project's README and docs as necessary.
Update
settings.py: For example, if you specified additional Hook implementations inhooks, or listed plugins underdisable_hooks_by_pluginin your.kedro.yml, you will need to move them tosettings.pyaccordingly:
```python
from
HOOKS = (ProjectHooks(), MyCustomHooks())
DISABLEHOOKSFORPLUGINS = ("myplugin1",) ```
Migration for
nodenames. From 0.17.0 the only allowed characters for node names are letters, digits, hyphens, underscores and/or fullstops. If you have previously defined node names that have special characters, spaces or other characters that are no longer permitted, you will need to rename those nodes.Copy changes to
kedro_cli.py. If you previously customised thekedro runcommand or added more CLI commands to yourkedro_cli.py, you should move them into<project_root>/src/<package_name>/cli.py. Note, however, that the new way to run a Kedro pipeline is via aKedroSession, rather than using theKedroContext:
python
with KedroSession.create(package_name=...) as session:
session.run()
Copy changes made to
ConfigLoader. If you have defined a custom class, such asTemplatedConfigLoader, by overridingProjectContext._create_config_loader, you should move the contents of the function insrc/<package_name>/hooks.py, underregister_config_loader.Copy changes made to
DataCatalog. Likewise, if you haveDataCatalogdefined withProjectContext._create_catalog, you should copy-paste the contents intoregister_catalog.Optional: If you have plugins such as Kedro-Viz installed, it's likely that Kedro 0.17.0 won't work with their older versions, so please either upgrade to the plugin's newest version or follow their migration guides.
- Python
Published by idanov over 5 years ago
kedro - 0.16.6
Major features and improvements
- Added documentation with a focus on single machine and distributed environment deployment; the series includes Docker, Argo, Prefect, Kubeflow, AWS Batch, AWS Sagemaker and extends our section on Databricks
- Added kedro-starter-spaceflights alias for generating a project:
kedro new --starter spaceflights.
Bug fixes and other changes
- Fixed
TypeErrorwhen converting dict inputs to a node made from a wrappedpartialfunction. PartitionedDataSetimprovements:- Supported passing arguments to the underlying filesystem.
- Improved handling of non-ASCII word characters in dataset names.
- For example, a dataset named
jalapeñowill be accessible asDataCatalog.datasets.jalapeñorather thanDataCatalog.datasets.jalape__o.
- For example, a dataset named
- Fixed
kedro installfor an Anaconda environment defined inenvironment.yml. - Fixed backwards compatibility with templates generated with older Kedro versions <0.16.5. No longer need to update
.kedro.ymlto usekedro lintandkedro jupyter notebook convert. - Improved documentation.
- Added documentation using MinIO with Kedro.
- Improved error messages for incorrect parameters passed into a node.
- Fixed issue with saving a
TensorFlowModelDatasetin the HDF5 format with versioning enabled. - Added missing
run_resultargument inafter_pipeline_runHooks spec. - Fixed a bug in IPython script that was causing context hooks to be registered twice. To apply this fix to a project generated with an older Kedro version, apply the same changes made in this PR to your
00-kedro-init.pyfile.
Thanks for supporting contributions
Deepyaman Datta, Bhavya Merchant, Lovkush Agarwal, Varun Krishna S, Sebastian Bertoli, noklam, Daniel Petti, Waylon Walker
- Python
Published by idanov over 5 years ago
kedro - 0.16.5
Major features and improvements
- Added the following new datasets.
| Type | Description | Location |
| --------------------------- | ------------------------------------------------------------------------------------------------------- | ----------------------------- |
| email.EmailMessageDataSet | Manage email messages using the Python standard library | kedro.extras.datasets.email |
- Added support for
pyproject.tomlto configure Kedro.pyproject.tomlis used if.kedro.ymldoesn't exist (Kedro configuration should be under[tool.kedro]section). - Projects created with this version will have no
pipeline.py, having been replaced byhooks.py. - Added a set of registration hooks, as the new way of registering library components with a Kedro project:
register_pipelines(), to replace_get_pipelines()register_config_loader(), to replace_create_config_loader()register_catalog(), to replace_create_catalog()These can be defined insrc/<package-name>/hooks.pyand added to.kedro.yml(orpyproject.toml). The order of execution is: plugin hooks,.kedro.ymlhooks, hooks inProjectContext.hooks.
- Added ability to disable auto-registered Hooks using
.kedro.yml(orpyproject.toml) configuration file.
Bug fixes and other changes
- Added option to run asynchronously via the Kedro CLI.
- Absorbed
.isort.cfgsettings intosetup.cfg. project_name,project_versionandpackage_namenow have to be defined in.kedro.ymlfor projects generated using Kedro 0.16.5+.- Packaging a modular pipeline raises an error if the pipeline directory is empty or non-existent.
Thanks for supporting contributions
Deepyaman Datta, Bas Nijholt, Sebastian Bertoli
- Python
Published by idanov over 5 years ago
kedro - 0.16.4
Release 0.16.4
Major features and improvements
- Enabled auto-discovery of hooks implementations coming from installed plugins.
Bug fixes and other changes
- Fixed a bug for using
ParallelRunneron Windows. - Modified
GBQTableDataSetto load customised results using customised queries from Google Big Query tables. - Documentation improvements.
Thanks for supporting contributions
Ajay Bisht, Vijay Sajjanar, Deepyaman Datta, Sebastian Bertoli, Shahil Mawjee, Louis Guitton, Emanuel Ferm
- Python
Published by idanov almost 6 years ago
kedro - 0.16.2
Major features and improvements
- Added the following new datasets.
| Type | Description | Location |
| ----------------------------------- | --------------------------------------------------------------------------------------------------------------------- | ---------------------------------- |
| pandas.AppendableExcelDataSet | Works with Excel file opened in append mode | kedro.extras.datasets.pandas |
| tensorflow.TensorFlowModelDataset | Works with TensorFlow models using TensorFlow 2.X | kedro.extras.datasets.tensorflow |
| holoviews.HoloviewsWriter | Works with Holoviews objects (saves as image file) | kedro.extras.datasets.holoviews |
kedro installwill now compile project dependencies (by runningkedro build-reqsbehind the scenes) before the installation if thesrc/requirements.infile doesn't exist.- Added
only_nodes_with_namespaceinPipelineclass to filter only nodes with a specified namespace. - Added the
kedro pipeline deletecommand to help delete unwanted or unused pipelines (it won't remove references to the pipeline in yourcreate_pipelines()code). - Added the
kedro pipeline packagecommand to help package up a modular pipeline. It will bundle up the pipeline source code, tests, and parameters configuration into a .whl file.
Bug fixes and other changes
- Improvement in
DataCatalog:- Introduced regex filtering to the
DataCatalog.list()method. - Non-alphanumeric characters (except underscore) in dataset name are replaced with
__inDataCatalog.datasets, for ease of access to transcoded datasets.
- Introduced regex filtering to the
- Improvement in Datasets:
- Improved initialization speed of
spark.SparkHiveDataSet. - Improved S3 cache in
spark.SparkDataSet. - Added support of options for building
pyarrowtable inpandas.ParquetDataSet.
- Improved initialization speed of
- Improvement in
kedro build-reqsCLI command:kedro build-reqsis now called with-qoption and will no longer print out compiled requirements to the console for security reasons.- All unrecognized CLI options in
kedro build-reqscommand are now passed to pip-compile call (e.g.kedro build-reqs --generate-hashes).
- Improvement in
kedro jupyterCLI command:- Improved error message when running
kedro jupyter notebook,kedro jupyter laborkedro ipythonwith Jupyter/IPython dependencies not being installed. - Fixed
%run_vizline magic for showing kedro viz inside a Jupyter notebook. For the fix to be applied on existing Kedro project, please see the migration guide. - Fixed the bug in IPython startup script (issue 298).
- Improved error message when running
- Documentation improvements:
- Updated community-generated content in FAQ.
- Added find-kedro and kedro-static-viz to the list of community plugins.
- Add missing
pillow.ImageDataSetentry to the documentation.
Breaking changes to the API
Migration guide from Kedro 0.16.1 to 0.16.2
Guide to apply the fix for %run_viz line magic in existing project
Even though this release ships a fix for project generated with kedro==0.16.2, after upgrading, you will still need to make a change in your existing project if it was generated with kedro>=0.16.0,<=0.16.1 for the fix to take effect. Specifically, please change the content of your project's IPython init script located at .ipython/profile_default/startup/00-kedro-init.py with the content of this file. You will also need kedro-viz>=3.3.1.
Thanks for supporting contributions
Miguel Rodriguez Gutierrez, Joel Schwarzmann, w0rdsm1th, Deepyaman Datta, Tam-Sanh Nguyen, Marcus Gawronsky
- Python
Published by idanov almost 6 years ago
kedro - 0.16.1
Bug fixes and other changes
- Fixed deprecation warnings from
kedro.cliandkedro.contextwhen runningkedro jupyter notebook. - Fixed a bug where
catalogandcontextwere not available in Jupyter Lab and Notebook. - Fixed a bug where
kedro build-reqswould fail if you didn't have your project dependencies installed.
- Python
Published by idanov about 6 years ago
kedro - 0.16.0
Major features and improvements
CLI
- Added new CLI commands (only available for the projects created using Kedro 0.16.0 or later):
kedro catalog listto list datasets in your catalogkedro pipeline listto list pipelineskedro pipeline describeto describe a specific pipelinekedro pipeline createto create a modular pipeline
- Improved the CLI speed by up to 50%.
- Improved error handling when making a typo on the CLI. We now suggest some of the possible commands you meant to type, in
git-style.
Framework
- All modules in
kedro.cliandkedro.contexthave been moved intokedro.framework.cliandkedro.framework.contextrespectively.kedro.cliandkedro.contextwill be removed in future releases. - Added
Hooks, which is a new mechanism for extending Kedro. - Fixed
load_contextchanging user's current working directory. - Allowed the source directory to be configurable in
.kedro.yml. - Added the ability to specify nested parameter values inside your node inputs, e.g.
node(func, "params:a.b", None)
DataSets
- Added the following new datasets.
| Type | Description | Location |
| -------------------------- | ------------------------------------------- | ------------------------------------------------ |
| pillow.ImageDataSet | Work with image files using Pillow | kedro.extras.datasets.pillow |
| geopandas.GeoJSONDataSet | Work with geospatial data using GeoPandas | kedro.extras.datasets.geopandas.GeoJSONDataSet |
| api.APIDataSet | Work with data from HTTP(S) API requests | kedro.extras.datasets.api.APIDataSet |
- Added
joblibbackend support topickle.PickleDataSet. - Added versioning support to
MatplotlibWriterdataset. - Added the ability to install dependencies for a given dataset with more granularity, e.g.
pip install "kedro[pandas.ParquetDataSet]". - Added the ability to specify extra arguments, e.g.
encodingorcompression, forfsspec.spec.AbstractFileSystem.open()calls when loading/saving a dataset. See Example 3 under docs.
Other
- Added
namespaceproperty onNode, related to the modular pipeline where the node belongs. - Added an option to enable asynchronous loading inputs and saving outputs in both
SequentialRunner(is_async=True)andParallelRunner(is_async=True)class. - Added
MemoryProfilertransformer. - Removed the requirement to have all dependencies for a dataset module to use only a subset of the datasets within.
- Added support for
pandas>=1.0. - Enabled Python 3.8 compatibility. Please note that a Spark workflow may be unreliable for this Python version as
pysparkis not fully-compatible with 3.8 yet. - Renamed "features" layer to "feature" layer to be consistent with (most) other layers and the relevant FAQ.
Bug fixes and other changes
- Fixed a bug where a new version created mid-run by an external system caused inconsistencies in the load versions used in the current run.
- Documentation improvements
- Added instruction in the documentation on how to create a custom runner).
- Updated contribution process in
CONTRIBUTING.md- added Developer Workflow. - Documented installation of development version of Kedro in the FAQ section.
- Added missing
_existsmethod toMyOwnDataSetexample in 04userguide/08advancedio.
- Fixed a bug where
PartitionedDataSetandIncrementalDataSetwere not working withs3aors3nprotocol. - Added ability to read partitioned parquet file from a directory in
pandas.ParquetDataSet. - Replaced
functools.lru_cachewithcachetools.cachedmethodinPartitionedDataSetandIncrementalDataSetfor per-instance cache invalidation. - Implemented custom glob function for
SparkDataSetwhen running on Databricks. - Fixed a bug in
SparkDataSetnot allowing for loading data from DBFS in a Windows machine using Databricks-connect. - Improved the error message for
DataSetNotFoundErrorto suggest possible dataset names user meant to type. - Added the option for contributors to run Kedro tests locally without Spark installation with
make test-no-spark. - Added option to lint the project without applying the formatting changes (
kedro lint --check-only).
Breaking changes to the API
Datasets
- Deleted obsolete datasets from
kedro.io. - Deleted
kedro.contribandextrasfolders. - Deleted obsolete
CSVBlobDataSetandJSONBlobDataSetdataset types. - Made
invalidate_cachemethod on datasets private. get_last_load_versionandget_last_save_versionmethods are no longer available onAbstractDataSet.get_last_load_versionandget_last_save_versionhave been renamed toresolve_load_versionandresolve_save_versiononAbstractVersionedDataSet, the results of which are cached.- The
release()method on datasets extendingAbstractVersionedDataSetclears the cached load and save version. All custom datasets must callsuper()._release()inside_release(). TextDataSetno longer hasload_argsandsave_args. These can instead be specified underopen_args_loadoropen_args_saveinfs_args.PartitionedDataSetandIncrementalDataSetmethodinvalidate_cachewas made private:_invalidate_caches.
Other
- Removed
KEDRO_ENV_VARfromkedro.contextto speed up the CLI run time. Pipeline.namehas been removed in favour ofPipeline.tag().- Dropped
Pipeline.transform()in favour ofkedro.pipeline.modular_pipeline.pipeline()helper function. - Made constant
PARAMETER_KEYWORDSprivate, and moved it fromkedro.pipeline.pipelinetokedro.pipeline.modular_pipeline. - Layers are no longer part of the dataset object, as they've moved to the
DataCatalog. - Python 3.5 is no longer supported by the current and all future versions of Kedro.
Migration guide from Kedro 0.15.* to Upcoming Release
Migration for datasets
Since all the datasets (from kedro.io and kedro.contrib.io) were moved to kedro/extras/datasets you must update the type of all datasets in <project>/conf/base/catalog.yml file.
Here how it should be changed: type: <SomeDataSet> -> type: <subfolder of kedro/extras/datasets>.<SomeDataSet> (e.g. type: CSVDataSet -> type: pandas.CSVDataSet).
In addition, all the specific datasets like CSVLocalDataSet, CSVS3DataSet etc. were deprecated. Instead, you must use generalized datasets like CSVDataSet.
E.g. type: CSVS3DataSet -> type: pandas.CSVDataSet.
Note: No changes required if you are using your custom dataset.
Migration for Pipeline.transform()
Pipeline.transform() has been dropped in favour of the pipeline() constructor. The following changes apply:
- Remember to import from kedro.pipeline import pipeline
- The prefix argument has been renamed to namespace
- And datasets has been broken down into more granular arguments:
- inputs: Independent inputs to the pipeline
- outputs: Any output created in the pipeline, whether an intermediary dataset or a leaf output
- parameters: params:... or parameters
As an example, code that used to look like this with the Pipeline.transform() constructor:
python
result = my_pipeline.transform(
datasets={"input": "new_input", "output": "new_output", "params:x": "params:y"},
prefix="pre"
)
When used with the new pipeline() constructor, becomes:
```python
from kedro.pipeline import pipeline
result = pipeline( mypipeline, inputs={"input": "newinput"}, outputs={"output": "new_output"}, parameters={"params:x": "params:y"}, namespace="pre" ) ```
Migration for decorators, color logger, transformers etc.
Since some modules were moved to other locations you need to update import paths appropriately.
You can find the list of moved files in the 0.15.6 release notes under the section titled Files with a new location.
Migration for KEDROENVVAR, the environment variable
Note: If you haven't made significant changes to your
kedro_cli.py, it may be easier to simply copy the updatedkedro_cli.py.ipython/profile_default/startup/00-kedro-init.pyand from GitHub or a newly generated project into your old project.
- We've removed
KEDRO_ENV_VARfromkedro.context. To get your existing project template working, you'll need to remove all instances ofKEDRO_ENV_VARfrom your project template:- From the imports in
kedro_cli.pyand.ipython/profile_default/startup/00-kedro-init.py:from kedro.context import KEDRO_ENV_VAR, load_context->from kedro.framework.context import load_context - Remove the
envvar=KEDRO_ENV_VARline from the click options inrun,jupyter_notebookandjupyter_labinkedro_cli.py - Replace
KEDRO_ENV_VARwith"KEDRO_ENV"in_build_jupyter_env - Replace
context = load_context(path, env=os.getenv(KEDRO_ENV_VAR))withcontext = load_context(path)in.ipython/profile_default/startup/00-kedro-init.py
- From the imports in
##### Migration for kedro build-reqs
We have upgraded pip-tools which is used by kedro build-reqs to 5.x. This pip-tools version requires pip>=20.0. To upgrade pip, please refer to their documentation.
Thanks for supporting contributions
@foolsgold, Mani Sarkar, Priyanka Shanbhag, Luis Blanche, Deepyaman Datta, Antony Milne, Panos Psimatikas, Tam-Sanh Nguyen, Tomasz Kaczmarczyk, Kody Fischer, Waylon Walker
- Python
Published by idanov about 6 years ago
kedro - 0.15.8
Major features and improvements
- Added the additional libraries to our
requirements.txtsopandas.CSVDataSetclass works out of box withpip install kedro. - Added
pandasto ourextra_requiresinsetup.py. - Improved the error message when dependencies of a
DataSetclass are missing.
- Python
Published by idanov about 6 years ago
kedro - 0.15.6
Major features and improvements
TL;DR We're launching
kedro.extras, the new home for our revamped series of datasets, decorators and dataset transformers. The datasets inkedro.extras.datasetsusefsspecto access a variety of data stores including local file systems, network file systems, cloud object stores (including S3 and GCP), and Hadoop, read more about this here. The change will allow #178 to happen in the next major release of Kedro.
An example of this new system can be seen below, loading the CSV SparkDataSet from S3:
yaml
weather:
type: spark.SparkDataSet # Observe the specified type, this affects all datasets
filepath: s3a://your_bucket/data/01_raw/weather* # filepath uses fsspec to indicate the file storage system
credentials: dev_s3
file_format: csv
You can also load data incrementally whenever it is dumped into a directory with the extension to PartionedDataSet, a feature that allows you to load a directory of files. The IncrementalDataSet stores the information about the last processed partition in a checkpoint, read more about this feature here.
New features
- Added
layerattribute for datasets inkedro.extras.datasetsto specify the name of a layer according to data engineering convention, this feature will be passed tokedro-vizin future releases. - Enabled loading a particular version of a dataset in Jupyter Notebooks and iPython, using
catalog.load("dataset_name", version="<2019-12-13T15.08.09.255Z>"). - Added property
run_idonProjectContext, used for versioning using theJournal. To customise your journalrun_idyou can override the private method_get_run_id(). - Added the ability to install all optional kedro dependencies via
pip install "kedro[all]". - Modified the
DataCatalog's load order for datasets, loading order is the following:kedro.iokedro.extras.datasets- Import path, specified in
type
- Added an optional
copy_modeflag toCachedDataSetandMemoryDataSetto specify (deepcopy,copyorassign) the copy mode to use when loading and saving.
New Datasets
| Type | Description | Location |
|----------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------|
| ParquetDataSet | Handles parquet datasets using Dask | kedro.extras.datasets.dask |
| PickleDataSet | Work with Pickle files using fsspec to communicate with the underlying filesystem | kedro.extras.datasets.pickle |
| CSVDataSet | Work with CSV files using fsspec to communicate with the underlying filesystem | kedro.extras.datasets.pandas |
| TextDataSet | Work with text files using fsspec to communicate with the underlying filesystem | kedro.extras.datasets.pandas |
| ExcelDataSet | Work with Excel files using fsspec to communicate with the underlying filesystem | kedro.extras.datasets.pandas |
| HDFDataSet | Work with HDF using fsspec to communicate with the underlying filesystem | kedro.extras.datasets.pandas |
| YAMLDataSet | Work with YAML files using fsspec to communicate with the underlying filesystem | kedro.extras.datasets.yaml |
| MatplotlibWriter | Save with Matplotlib images using fsspec to communicate with the underlying filesystem | kedro.extras.datasets.matplotlib |
| NetworkXDataSet | Work with NetworkX files using fsspec to communicate with the underlying filesystem | kedro.extras.datasets.networkx |
| BioSequenceDataSet | Work with bio-sequence objects using fsspec to communicate with the underlying filesystem | kedro.extras.datasets.biosequence |
| GBQTableDataSet | Work with Google BigQuery | kedro.extras.datasets.pandas |
| FeatherDataSet | Work with feather files using fsspec to communicate with the underlying filesystem | kedro.extras.datasets.pandas |
| IncrementalDataSet | Inherit from PartitionedDataSet and remembers the last processed partition | kedro.io |
Files with a new location
| Type | New Location |
|--------------------------------------------------------------------------------------------------|----------------------------------------------|
| JSONDataSet | kedro.extras.datasets.pandas |
| CSVBlobDataSet | kedro.extras.datasets.pandas |
| JSONBlobDataSet | kedro.extras.datasets.pandas |
| SQLTableDataSet | kedro.extras.datasets.pandas |
| SQLQueryDataSet | kedro.extras.datasets.pandas |
| SparkDataSet | kedro.extras.datasets.spark |
| SparkHiveDataSet | kedro.extras.datasets.spark |
| SparkJDBCDataSet | kedro.extras.datasets.spark |
| kedro/contrib/decorators/retry.py | kedro/extras/decorators/retry_node.py |
| kedro/contrib/decorators/memory_profiler.py | kedro/extras/decorators/memory_profiler.py |
| kedro/contrib/io/transformers/transformers.py | kedro/extras/transformers/time_profiler.py |
| kedro/contrib/colors/logging/color_logger.py | kedro/extras/logging/color_logger.py |
| extras/ipython_loader.py | tools/ipython/ipython_loader.py |
| kedro/contrib/io/cached/cached_dataset.py | kedro/io/cached_dataset.py |
| kedro/contrib/io/catalog_with_default/data_catalog_with_default.py | kedro/io/data_catalog_with_default.py |
| kedro/contrib/config/templated_config.py | kedro/config/templated_config.py |
Upcoming deprecations
| Category | Type |
|---------------------------|----------------------------------------------------------------|
| Datasets | BioSequenceLocalDataSet |
| | CSVGCSDataSet |
| | CSVHTTPDataSet |
| | CSVLocalDataSet |
| | CSVS3DataSet |
| | ExcelLocalDataSet |
| | FeatherLocalDataSet |
| | JSONGCSDataSet |
| | JSONLocalDataSet |
| | HDFLocalDataSet |
| | HDFS3DataSet |
| | kedro.contrib.io.cached.CachedDataSet |
| | kedro.contrib.io.catalog_with_default.DataCatalogWithDefault |
| | MatplotlibLocalWriter |
| | MatplotlibS3Writer |
| | NetworkXLocalDataSet |
| | ParquetGCSDataSet |
| | ParquetLocalDataSet |
| | ParquetS3DataSet |
| | PickleLocalDataSet |
| | PickleS3DataSet |
| | TextLocalDataSet |
| | YAMLLocalDataSet |
| Decorators | kedro.contrib.decorators.memory_profiler |
| | kedro.contrib.decorators.retry |
| | kedro.contrib.decorators.pyspark.spark_to_pandas |
| | kedro.contrib.decorators.pyspark.pandas_to_spark |
| Transformers | kedro.contrib.io.transformers.transformers |
| Configuration Loaders | kedro.contrib.config.TemplatedConfigLoader |
Bug fixes and other changes
- Added the option to set/overwrite params in
config.yamlusing YAML dict style instead of string CLI formatting only. - Kedro CLI arguments
--nodeand--tagsupport comma-separated values, alternative methods will be deprecated in future releases. - Fixed a bug in the
invalidate_cachemethod ofParquetGCSDataSetandCSVGCSDataSet. --load-versionnow won't break if version value contains a colon.- Enabled running
nodes with duplicate inputs. - Improved error message when empty credentials are passed into
SparkJDBCDataSet. - Fixed bug that caused an empty project to fail unexpectedly with ImportError in
template/.../pipeline.py. - Fixed bug related to saving dataframe with categorical variables in table mode using
HDFS3DataSet. - Fixed bug that caused unexpected behavior when using
from_nodesandto_nodesin pipelines using transcoding. - Credentials nested in the dataset config are now also resolved correctly.
- Bumped minimum required pandas version to 0.24.0 to make use of
pandas.DataFrame.to_numpy(recommended alternative topandas.DataFrame.values). - Docs improvements.
Pipeline.transformskips modifying node inputs/outputs containingparams:orparameterskeywords.- Support for
dataset_credentialskey in the credentials forPartitionedDataSetis now deprecated. The dataset credentials should be specified explicitly inside the dataset config. - Datasets can have a new
confirmfunction which is called after a successful node function execution if the node containsconfirmsargument with such dataset name. - Make the resume prompt on pipeline run failure use
--from-nodesinstead of--from-inputsto avoid unnecessarily re-running nodes that had already executed. - When closed, Jupyter notebook kernels are automatically terminated after 30 seconds of inactivity by default. Use
--idle-timeoutoption to update it. - Added
kedro-vizto the Kedro project templaterequirements.txtfile. - Removed the
resultsandreferencesfolder from the project template. - Updated contribution process in
CONTRIBUTING.md.
Breaking changes to the API
- Existing
MatplotlibWriterdataset incontribwas renamed toMatplotlibLocalWriter. kedro/contrib/io/matplotlib/matplotlib_writer.pywas renamed tokedro/contrib/io/matplotlib/matplotlib_local_writer.py.kedro.contrib.io.bioinformatics.sequence_dataset.pywas renamed tokedro.contrib.io.bioinformatics.biosequence_local_dataset.py.
Thanks for supporting contributions
Andrii Ivaniuk, Jonas Kemper, Yuhao Zhu, Balazs Konig, Pedro Abreu, Tam-Sanh Nguyen, Peter Zhao, Deepyaman Datta, Florian Roessler, Miguel Rodriguez Gutierrez
- Python
Published by idanov over 6 years ago
kedro - 0.15.5
Major features and improvements
- New CLI commands and command flags:
- Load multiple
kedro runCLI flags from a configuration file with the--configflag (e.g.kedro run --config run_config.yml) - Run parametrised pipeline runs with the
--paramsflag (e.g.kedro run --params param1:value1,param2:value2). - Lint your project code using the
kedro lintcommand, your project is linted withblack(Python 3.6+),flake8andisort.
- Load multiple
- Load specific environments with Jupyter notebooks using
KEDRO_ENVwhich will globally setrun,jupyter notebookandjupyter labcommands using environment variables. - Added the following datasets:
CSVGCSDataSetdataset incontribfor working with CSV files in Google Cloud Storage.ParquetGCSDataSetdataset incontribfor working with Parquet files in Google Cloud Storage.JSONGCSDataSetdataset incontribfor working with JSON files in Google Cloud Storage.MatplotlibS3Writerdataset incontribfor saving Matplotlib images to S3.PartitionedDataSetfor working with datasets split across multiple files.JSONDataSetdataset for working with JSON files that usesfsspecto communicate with the underlying filesystem. It doesn't supporthttp(s)protocol for now.
- Added
s3fs_argsto all S3 datasets. - Pipelines can be deducted with
pipeline1 - pipeline2.
Bug fixes and other changes
ParallelRunnernow works withSparkDataSet.- Allowed the use of nulls in
parameters.yml. - Fixed an issue where
%reload_kedrowasn't reloading all user modules. - Fixed
pandas_to_sparkandspark_to_pandasdecorators to work with functions with kwargs. - Fixed a bug where
kedro jupyter notebookandkedro jupyter labwould run a different Jupyter installation to the one in the local environment. - Implemented Databricks-compatible dataset versioning for
SparkDataSet. - Fixed a bug where
kedro packagewould fail in certain situations wherekedro build-reqswas used to generaterequirements.txt. - Made
bucket_nameargument optional for the following datasets:CSVS3DataSet,HDFS3DataSet,PickleS3DataSet,contrib.io.parquet.ParquetS3DataSet,contrib.io.gcs.JSONGCSDataSet- bucket name can now be included into the filepath along with the filesystem protocol (e.g.s3://bucket-name/path/to/key.csv). - Documentation improvements and fixes.
Breaking changes to the API
- Renamed entry point for running pip-installed projects to
run_package()instead ofmain()insrc/<package>/run.py. bucket_namekey has been removed from the string representation of the following datasets:CSVS3DataSet,HDFS3DataSet,PickleS3DataSet,contrib.io.parquet.ParquetS3DataSet,contrib.io.gcs.JSONGCSDataSet.- Moved the
mem_profilerdecorator tocontriband separated thecontribdecorators so that dependencies are modular. You may need to update your import paths, for example the pyspark decorators should be imported asfrom kedro.contrib.decorators.pyspark import <pyspark_decorator>instead offrom kedro.contrib.decorators import <pyspark_decorator>.
Thanks for supporting contributions
Sheldon Tsen, @roumail, Karlson Lee, Waylon Walker, Deepyaman Datta, Giovanni, Zain Patel
- Python
Published by idanov over 6 years ago
kedro - 0.15.4
Major features and improvements
kedro jupyternow gives the default kernel a sensible name.Pipeline.namehas been deprecated in favour ofPipeline.tags.- Reuse pipelines within a Kedro project using
Pipeline.transform, it simplifies dataset and node renaming. - Added Jupyter Notebook line magic (
%run_viz) to runkedro vizin a Notebook cell (requireskedro-vizversion 3.0.0 or later). - Added the following datasets:
NetworkXLocalDataSetinkedro.contrib.io.networkxto load and save local graphs (JSON format) via NetworkX. (by @josephhaaga)SparkHiveDataSetinkedro.contrib.io.pyspark.SparkHiveDataSetallowing usage of Spark and insert/upsert on non-transactional Hive tables.
kedro.contrib.config.TemplatedConfigLoadernow supports name/dict key templating and default values.
Bug fixes and other changes
get_last_load_version()method for versioned datasets now returns exact last load version if the dataset has been loaded at least once andNoneotherwise.- Fixed a bug in
_existsmethod for versionedSparkDataSet. - Enabled the customisation of the ExcelWriter in
ExcelLocalDataSetby specifying options underwriterkey insave_args. - Fixed a bug in IPython startup script, attempting to load context from the incorrect location.
- Removed capping the length of a dataset's string representation.
- Fixed
kedro installcommand failing on Windows ifsrc/requirements.txtcontains a different version of Kedro. - Enabled passing a single tag into a node or a pipeline without having to wrap it in a list (i.e.
tags="my_tag").
Breaking changes to the API
- Removed
_check_paths_consistency()method fromAbstractVersionedDataSet. Version consistency check is now done inAbstractVersionedDataSet.save(). Custom versioned datasets should modifysave()method implementation accordingly.
Thanks for supporting contributions
Joseph Haaga, Deepyaman Datta, Joost Duisters, Zain Patel, Tom Vigrass
- Python
Published by nakhan98 over 6 years ago
kedro - 0.15.2
Major features and improvements
- Added
--load-version, akedro runargument that allows you run the pipeline with a particular load version of a dataset. - Support for modular pipelines in
src/, break the pipeline into isolated parts with reusability in mind. - Support for multiple pipelines, an ability to have multiple entry point pipelines and choose one with
kedro run --pipeline NAME. - Added a
MatplotlibWriterdataset incontribfor saving Matplotlib images. - An ability to template/parameterize configuration files with
kedro.contrib.config.TemplatedConfigLoader. - Parameters are exposed as a context property for ease of access in iPython / Jupyter Notebooks with
context.params. - Added
max_workersparameter forParallelRunner.
Bug fixes and other changes
- Users will override the
_get_pipelineabstract method inProjectContext(KedroContext)inrun.pyrather than thepipelineabstract property. Thepipelineproperty is not abstract anymore. - Improved an error message when versioned local dataset is saved and unversioned path already exists.
- Added
catalogglobal variable to00-kedro-init.py, allowing you to load datasets withcatalog.load(). - Enabled tuples to be returned from a node.
- Disallowed the
ConfigLoaderloading the same file more than once, and deduplicated theconf_pathspassed in. - Added a
--openflag tokedro build-docsthat opens the documentation on build. - Updated the
Pipelinerepresentation to include name of the pipeline, also making it readable as a context property. kedro.contrib.io.pyspark.SparkDataSetandkedro.contrib.io.azure.CSVBlobDataSetnow support versioning.
Breaking changes to the API
KedroContext.run()no longer acceptscatalogandpipelinearguments.node.inputsnow returns the node's inputs in the order required to bind them properly to the node's function.
Thanks for supporting contributions
Deepyaman Datta, Luciano Issoe, Joost Duisters, Zain Patel, William Ashford, Karlson Lee
- Python
Published by nakhan98 over 6 years ago
kedro - 0.15.1
Major features and improvements
- Extended
versioningsupport to cover the tracking of environment setup, code and datasets. - Added the following datasets:
FeatherLocalDataSetincontribfor usage with pandas. (by @mdomarsaleem)
- Added
get_last_load_versionandget_last_save_versiontoAbstractVersionedDataSet. - Implemented
__call__method onNodeto allow for users to executemy_node(input1=1, input2=2)as an alternative tomy_node.run(dict(input1=1, input2=2)). - Added new
--from-inputsrun argument.
Bug fixes and other changes
- Fixed a bug in
load_context()not loading context in non-Kedro Jupyter Notebooks. - Fixed a bug in
ConfigLoader.get()not listing nested files for**-ending glob patterns. - Fixed a logging config error in Jupyter Notebook.
- Updated documentation in
03_configurationregarding how to modify the configuration path. - Documented the architecture of Kedro showing how we think about library, project and framework components.
extras/kedro_project_loader.pyrenamed toextras/ipython_loader.pyand now runs any IPython startup scripts without relying on the Kedro project structure.- Fixed TypeError when validating partial function's signature.
- After a node failure during a pipeline run, a resume command will be suggested in the logs. This command will not work if the required inputs are MemoryDataSets.
Breaking changes to the API
None
Thanks for supporting contributions
Omar Saleem, Mariana Silva, Anil Choudhary, Craig
- Python
Published by nakhan98 over 6 years ago
kedro - 0.15.0
Major features and improvements
- Added
KedroContextbase class which holds the configuration and Kedro's main functionality (catalog, pipeline, config, runner). - Added a new CLI command
kedro jupyter convertto facilitate converting Jupyter Notebook cells into Kedro nodes. - Added support for
pip-compileand new Kedro commandkedro build-reqsthat generatesrequirements.txtbased onrequirements.in. - Running
kedro installwill install packages to conda environment ifsrc/environment.ymlexists in your project. - Added a new
--nodeflag tokedro run, allowing users to run only the nodes with the specified names. - Added new
--from-nodesand--to-nodesrun arguments, allowing users to run a range of nodes from the pipeline. - Added prefix
params:to the parameters specified inparameters.ymlwhich allows users to differentiate between their different parameter node inputs and outputs. - Jupyter Lab/Notebook now starts with only one kernel by default.
- Added the following datasets:
-
CSVHTTPDataSetto load CSV using HTTP(s) links. JSONBlobDataSetto load json (-delimited) files from Azure Blob Storage.ParquetS3DataSetincontribfor usage with pandas. (by @mmchougule)CachedDataSetincontribwhich will cache data in memory to avoid io/network operations. It will clear the cache once a dataset is no longer needed by a pipeline. (by @tsanikgr)YAMLLocalDataSetincontribto load and save local YAML files. (by @Minyus)
-
Bug fixes and other changes
- Documentation improvements including instructions on how to initialise a Spark session using YAML configuration.
anyconfigdefault log level changed fromINFOtoWARNING.- Added information on installed plugins to
kedro info. - Added style sheets for project documentation, so the output of
kedro build-docswill resemble the style ofkedro docs.
Breaking changes to the API
- Simplified the Kedro template in
run.pywith the introduction ofKedroContextclass. - Merged
FilepathVersionMixInandS3VersionMixInunder one abstract classAbstractVersionedDataSetwhich extendsAbstractDataSet. namechanged to be a keyword-only argument forPipeline.CSVLocalDataSetno longer supports URLs.CSVHTTPDataSetsupports URLs.
Migration guide from Kedro 0.14.X to Kedro 0.15.0
Migration for Kedro project template
This guide assumes that:
* The framework specific code has not been altered significantly
* Your project specific code is stored in the dedicated python package under src/.
The breaking changes were introduced in the following project template files:
- <project-name>/.ipython/profile_default/startup/00-kedro-init.py
- <project-name>/kedro_cli.py
- <project-name>/src/tests/test_run.py
- <project-name>/src/<package-name>/run.py
- <project-name>/.kedro.yml (new file)
The easiest way to migrate your project from Kedro 0.14.* to Kedro 0.15.0 is to create a new project (by using kedro new) and move code and files bit by bit as suggested in the detailed guide below:
Create a new project with the same name by running
kedro newCopy the following folders to the new project:
results/references/notebooks/logs/data/conf/
If you customised your
src/<package>/run.py, make sure you apply the same customisations tosrc/<package>/run.py- If you customised
get_config(), you can overrideconfig_loaderproperty inProjectContextderived class - If you customised
create_catalog(), you can overridecatalog()property inProjectContextderived class - If you customised
run(), you can overriderun()method inProjectContextderived class - If you customised default
env, you can override it inProjectContextderived class or pass it at construction. By default,envislocal. - If you customised default
root_conf, you can overrideCONF_ROOTattribute inProjectContextderived class. By default,KedroContextbase class hasCONF_ROOTattribute set toconf.
- If you customised
The following syntax changes are introduced in ipython or Jupyter notebook/labs:
proj_dir->context.project_pathproj_name->context.project_nameconf->context.config_loader.io->context.catalog(e.g.,io.load()->context.catalog.load())
If you customised your
kedro_cli.py, you need to apply the same customisations to yourkedro_cli.pyin the new project.Copy the contents of the old project's
src/requirements.txtinto the new project'ssrc/requirements.inand, from the project root directory, run thekedro build-reqscommand in your terminal window.
Migration for versioning custom dataset classes
If you defined any custom dataset classes which support versioning in your project, you need to apply the following changes:
- Make sure your dataset inherits from
AbstractVersionedDataSetonly. - Call
super().__init__()with the appropriate arguments in the dataset's__init__. If storing on local filesystem, providing the filepath and the version is enough. Otherwise, you should also pass in anexists_functionand aglob_functionthat emulateexistsandglobin a different filesystem (seeCSVS3DataSetas an example). - Remove setting of the
_filepathand_versionattributes in the dataset's__init__, as this is taken care of in the base abstract class. - Any calls to
_get_load_pathand_get_save_pathmethods should take no arguments. - Ensure you convert the output of
_get_load_pathand_get_save_pathappropriately, as these now returnPurePaths instead of strings. - Make sure
_check_paths_consistencyis called withPurePaths as input arguments, instead of strings.
These steps should have brought your project to Kedro 0.15.0. There might be some more minor tweaks needed as every project is unique, but now you have a pretty solid base to work with. If you run into any problems, please consult the Kedro documentation.
Thanks for supporting contributions
Dmitry Vukolov, Jo Stichbury, Angus Williams, Deepyaman Datta, Mayur Chougule, Marat Kopytjuk, Evan Miller, Yusuke Minami
- Python
Published by nakhan98 almost 7 years ago
kedro - 0.14.3
Major features and improvements
- Tab completion for catalog datasets in
ipythonorjupytersessions. (Thank you @datajoely and @WaylonWalker) - Added support for transcoding, an ability to decouple loading/saving mechanisms of a dataset from its storage location, denoted by adding '@' to the dataset name.
- Datasets have a new
releasefunction that instructs them to free any cached data. The runners will call this when the dataset is no longer needed downstream.
Bug fixes and other changes
- Add support for pipeline nodes made up from partial functions.
- Expand user home directory
~for TextLocalDataSet (see issue #19). - Add a
short_nameproperty toNodes for a display-friendly (but not necessarily unique) name. - Add Kedro project loader for IPython:
extras/kedro_project_loader.py. - Fix source file encoding issues with Python 3.5 on Windows.
- Fix local project source not having priority over the same source installed as a package, leading to local updates not being recognised.
Breaking changes to the API
- Remove the maxloads argument from the
MemoryDataSetconstructor and from the `AbstractRunner.createdefaultdataset` method.
Thanks for supporting contributions
Joel Schwarzmann, Alex Kalmikov
- Python
Published by nakhan98 almost 7 years ago
kedro - 0.14.2
Major features and improvements
- Added Data Set transformer support in the form of AbstractTransformer and DataCatalog.add_transformer.
Breaking changes to the API
- Merged the
ExistsMixinintoAbstractDataSet. Pipeline.node_dependenciesreturns a dictionary keyed by node, with sets of parent nodes as values;PipelineandParallelRunnerwere refactored to make use of this for topological sort for node dependency resolution and running pipelines respectively.Pipeline.grouped_nodesreturns a list of sets, rather than a list of lists.
Thanks for supporting contributions
- Python
Published by nakhan98 almost 7 years ago
kedro - 0.14.0
Major features and improvements
The initial release of Kedro.
Thanks for supporting contributions
Jo Stichbury, Aris Valtazanos, Fabian Peters, Guilherme Braccialli, Joel Schwarzmann, Miguel Beltre, Mohammed ElNabawy, Deepyaman Datta, Shubham Agrawal, Oleg Andreyev, Mayur Chougule, William Ashford, Ed Cannon, Nikhilesh Nukala, Sean Bailey, Vikram Tegginamath, Thomas Huijskens, Musa Bilal.
We are also grateful to everyone who advised and supported us, filed issues or helped resolve them, asked and answered questions and were part of inspiring discussions.
- Python
Published by nakhan98 almost 7 years ago