Science Score: 62.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
✓Institutional organization owner
Organization ssec-jhu has institutional domain (ai.jhu.edu) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.7%) to scientific vocabulary
Repository
Distributed(Data, Discovery) Pipeline Uitilities
Basic Info
Statistics
- Stars: 2
- Watchers: 4
- Forks: 0
- Open Issues: 14
- Releases: 19
Metadata Files
README.md
SSEC-JHU dplutils

About
This package provides python utilities to define and execute graphs of tasks that operate on and produce dataframes in a batched-streaming manner. The primary aims are as follows:
- Operate on an indefinite stream of batches of input data.
- Execute tasks in a distributed fashion using configurable execution backends (e.g. Ray).
- Schedule resources on a per-task basis.
- Configurable batching: enable co-location for high-overhead tasks or maximal spread for resource intensive tasks.
- Provide a simple interface to create tasks and insert them into a pipeline.
- Provide validations to help ensure tasks are correctly configured prior to execution (potentially with high latency).
Discovery pipelines that generate input samples on-the-fly are particularly well suited for this package, though input can be taken from any source, including on-disk tables.
This package also includes utilities for observing pipelines using metrics tools such as mlflow or aim, and provides functionality for making simple CLI interfaces from pipeline definitions.
Quick Start
Getting up and running is easy: simply install the package:
sh
pip install dplutils
Then define the tasks, connect them into a pipeline, and then iterate over the output dataframes:
```py import numpy as np from dplutils.pipeline import PipelineGraph, PipelineTask
Definitions of task code - note all take dataframe as first argument and return a dataframe
def generate_sample(df): return df.assign(sample = np.random.random(len(df)))
def round_sample(df, decimals=1): return df.assign(rounded = df['sample'].round(decimals))
def calc_residual(df): return df.assign(residual = df['rounded'] - df['sample'])
Connect them together in an execution graph (along with execution metadata)
pipeline = PipelineGraph([ PipelineTask('generate', generatesample), PipelineTask('round', roundsample, batchsize=10), PipelineTask('resid', calcresidual, num_cpus=2), ])
Run the tasks and iterate over the outputs, here using the Ray execution framework
from dplutils.pipeline.ray import RayStreamGraphExecutor
executor = RayStreamGraphExecutor(pipeline).setconfig('round.kwargs.decimals', 2) for resultbatch in executor.run(): print(result_batch) break # otherwise it will - by design - run indefinitely! ```
As an alternative to iterating over batches directly as above, we can use the CLI utilities to run the given executor as a tool. The helper arranges for all options so we just need to define the desired executor:
```py executor = RayStreamGraphExecutor(pipeline)
if name == 'main': cli_run(executor) ```
then run our module with parameters as needed. The CLI based run will write the
output to a parquet table at a specified location (below assumes code is in
ourmodule.py):
sh
python -m ourmodule -o /path/to/outdir --set-config round.batch_size=5
The above is of course a trivial example to demonstrate the structure and simplicity of defining pipelines and how they operate on configurable-sized batches represented as dataframes. For more information on defining tasks, their inputs, seeding the input tasks, more complex graph structures and distributed execution, among other topics, see the documentation at: https://dplutils.readthedocs.io/en/stable/.
Scaling out
One of the goals of this project simplify the scaling out of connected tasks on a variety of systems. PipelineExecutors PipelineExecutor are responsible for this - this package provides a framework for adding executors based on appropriate underlying scheduling/execution systems and provides some implementations, for example using Ray RayStreamGraphExecutor. Setup required to properly configure an executor for scaling depends on the backend used, for example ray relies on having a cluster previously bootstrapped (see https://docs.ray.io/en/latest/cluster/getting-started.html), though can operate locally without any prior setup.
Resource specifications
Another primary goal is to arrange for resource dependencies to be met by the execution environment for a particular task. Resources such as number of CPUs or GPUs are natural targets and supported by many systems. We also want to support arbitrary resource requests, for example if a task requires a large local database on fast disks, this might be available only on one node in a cluster. A custom resource can be used to ensure that batches of a particular task always execute on the environment that has that resource.
This ability depends on the executor backend to support it, so executors implementations typically only make sense for such systems - of which Ray is one.
For instance, if a task required a database as described above, the task might be defined in the following manner:
py
PipelineTask('bigdboperation', function_needs_big_db, resources={'bigdb': 1})
And if using the Ray executor, then at least one worker in the cluster that has local fast disks with the big database resident would be started similar to:
sh
ray start --address {head-ip} --resources '{"bigdb": 1}'
In other execution systems, the worker might be started in a different manner, but the task definition could remain as-is, enabling easy swapping of the execution environment depending on the situation.
Building and Development
If you need to make modifications to the source code, follow the steps below to get the source and run tests. The process is simple and we use Tox to manage test and build environments. We welcome any contributions, please open a pull request!
Setup
Clone this repository locally:
sh
git clone https://github.com/ssec-jhu/dplutils.git
cd dplutils
Install dependencies:
sh
pip install -r requirements/dev.txt
Tests
Run tox:
tox -e test, to run just the teststox, to run linting, tests and build. This should be run without errors prior to commit
Docker
From the repo directory, run
docker build -f docker/Dockerfile --tag dplutils .
Owner
- Name: Scientific Software Engineering Center at JHU
- Login: ssec-jhu
- Kind: organization
- Email: ssec@jhu.edu
- Location: United States of America
- Website: https://ai.jhu.edu/ssec/
- Repositories: 1
- Profile: https://github.com/ssec-jhu
Accelerating Software Development for Science Research
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software in your work, please cite it using the following metadata." authors: - family-names: "Hunter" given-names: "Edward" orcid: "https://orcid.org/0000-0002-6876-001X" - family-names: "Noss" given-names: "James" orcid: "https://orcid.org/0000-0002-0922-5770" - family-names: "Kluzner" given-names: "Vladimir" orcid: "https://orcid.org/0009-0000-5844-661X" - family-names: "Lemson" given-names: "Gerard" orcid: "https://orcid.org/0000-0001-5041-2458" - family-names: "Mitschang" given-names: "Arik" orcid: "https://orcid.org/0000-0001-9239-012X" title: "dplutils" version: 0.0.1 doi: <insert zenodo DOI> date-released: 2023-01-01 url: "https://github.com/ssec-jhu/dplutils"
GitHub Events
Total
- Create event: 27
- Issues event: 3
- Release event: 2
- Watch event: 1
- Delete event: 25
- Issue comment event: 29
- Push event: 44
- Pull request review comment event: 2
- Pull request review event: 23
- Pull request event: 49
Last Year
- Create event: 27
- Issues event: 3
- Release event: 2
- Watch event: 1
- Delete event: 25
- Issue comment event: 29
- Push event: 44
- Pull request review comment event: 2
- Pull request review event: 23
- Pull request event: 49
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 24
- Total pull requests: 79
- Average time to close issues: about 1 month
- Average time to close pull requests: 6 days
- Total issue authors: 3
- Total pull request authors: 4
- Average comments per issue: 0.08
- Average comments per pull request: 1.16
- Merged pull requests: 55
- Bot issues: 1
- Bot pull requests: 52
Past Year
- Issues: 4
- Pull requests: 28
- Average time to close issues: 5 days
- Average time to close pull requests: 5 days
- Issue authors: 1
- Pull request authors: 2
- Average comments per issue: 0.5
- Average comments per pull request: 0.57
- Merged pull requests: 16
- Bot issues: 0
- Bot pull requests: 22
Top Authors
Issue Authors
- amitschang (22)
- jamienoss (1)
- dependabot[bot] (1)
Pull Request Authors
- dependabot[bot] (52)
- amitschang (25)
- jamienoss (1)
- ryanhausen (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 33 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 16
- Total maintainers: 1
pypi.org: dplutils
- Documentation: https://dplutils.readthedocs.io/
- License: BSD 3-Clause License Copyright (c) 2023, Scientific Software Engineering Center at JHU Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
Latest release: 0.8.1
published 12 months ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v3 composite
- actions/setup-python v4 composite
- actions/upload-artifact v3 composite
- docker/build-push-action f2a1d5e99d037542a71f64918e516c093c6f3fc4 composite
- docker/login-action 65b78e6e13532edd9afa3aa52ac7964289d1a9c1 composite
- docker/metadata-action 9ec57ed1fcdbf14dcef7dfbe97b2010124a938b7 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- python 3.10 build
- numpy *
- pandas *
- pyarrow *
- ray [default]
- bandit ==1.7.5
- build ==0.10.0
- pytest ==7.4.2
- pytest-cov ==4.1.0
- ruff ==0.1.2
- tox ==4.11.3 development
- nbsphinx ==0.9.3
- sphinx ==6.2.1
- sphinx-automodapi ==0.15.0
- sphinx-issues ==3.0.1
- sphinx_book_theme ==1.0.1
- sphinx_rtd_theme ==1.3.0
- numpy ==1.26.1
- pandas ==2.1.1
- pyarrow ==13.0.0
- ray ==2.7.1
- actions/checkout v4 composite
- actions/download-artifact v3 composite
- actions/setup-python v4 composite
- actions/upload-artifact v3 composite
- pypa/gh-action-pypi-publish release/v1 composite
- aim ==3.17.5 test
- mlflow ==2.8.1 test