dane-workflows
Python library for working with DANE environments
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.4%) to scientific vocabulary
Repository
Python library for working with DANE environments
Basic Info
- Host: GitHub
- Owner: beeldengeluid
- License: mit
- Language: Python
- Default Branch: main
- Size: 551 KB
Statistics
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 4
- Releases: 0
Metadata Files
README.md
Introduction
Python library for creating "processing workflows" that use DANE environments, which in a nutshell offer, depending on the setup of each environment, an API for some kind of multi-media processing, e.g.:
- Automatic Speech Recognition
- Named Entity Extraction
- Computer Vision algorithms
- Any kind of Machine Learning algorithm
This Python library is however not limited to using DANE, but cannot also be used to hook up any API that does something with generating certain data from certain input data.
Achitecture
The following image illustrates the dane-workflows architecture:

The following section details more about concepts illustrated in the image.
Definition of a workflow
A workflow is able to iteratively:
- obtain input/source data from a DataProvider
- send it to a ProcessingEnvironment (e.g. DANE environment)
- wait for the processing environment to complete its work
- obtain results from the processing environment
- pass results to an Exporter, which typically reconsiles the processed data with the source data
As mentioned in the definition of a workflow, this Python library works with the following components/concepts:
TaskScheduler
Main process that handles all the steps described in the Definition of a workflow
StatusHandler
Keeps track of the workflow status, esuring recovery after crashes. By default the status is persisted to a SQLite database file, using the SQLiteStatusHandler but other implementations can be made by subclassing StatusHandler.
StatusMonitor
Note: This component is currently implemented and not yet available.
Runs on top of the StatusHandler database and visualises the overall progress of a workflow in a human-readable manner (e.g. show the % of successfully/failed processed items)
DataProvider
Iteratively called by the TaskScheduler to obtain a new batch of source data. No default implementations are available (yet), since there are many possible ways one would want to supply data to a system. Simply subclass from DataProvider to have full control over your input flow.
DataProcessingEnvironment
Iteratively called by the TaskScheduler to submit batches of data to an (external) processing environment. Also takes care of obtaining the output of finished processes from such an environment.
This library contains a full implementation, DANEEnvironment, for interacting with DANE environments, but other environments/APIs can be supported by subclassing from ProcessingEnvironment.
Exporter
Called by the TaskScheduler with output data from a processing environment. No default implementation is available (yet), since this is typically the most use-case sensitive part of any workflow, meaning you should decide what to do with the output data (by subclassing Exporter).
Getting started
Prerequisites
- Python >= 3.8 <= 3.10
- Poetry
Installation
Install via pypi.org, latest version, using e.g.
pip install dane-workflows
local development
Run poetry install. After completion run:
poetry shell
To test the contents of this repository works well, run:
./scripts/check-project.sh
TODO finalise
Usage
After installing dane-workflows in your local environment, you can run an example workflow with:
python main.py
This example script uses config-example.yml to configure and run a workflow using the following implementations:
- DataProvider: ExampleDataProvider (with two dummy input documents)
- DataProcessingEnvironment: ExampleDataProcessingEnvironment (mocks processing environment)
- StatusHandler: SQLiteStatusHandler (writes output to
./proc_stats/all_stats.db) - Exporter: ExampleExporter (does nothing with results)
To setup a workflow for your own purposes, consider the following:
What data do I want to process?
We've provided the ExampleDataProvider to easily feed a workflow with a couple of files (via config.yml). This is mostly for testing out your workflow.
Mostly likely you'll need to implement your own DataProvider by subclassing it. This way you can e.g. load your input data from a database, spreadsheet or whatever else you need.
Which processing environment will I use?
Since this project is developed to at least interface with running DANE environments we've provided DANEEnvironment as a default implementation of DataProcessingEnvironment.
In case you'd like to call any other tool for processing your data, you're required to implement a subclass of DataProcessingEnvironment.
What I will I do with the output of the processing environment?
After your DataProcessingEnvironment has processed a batch of items from your DataProvider the TaskScheduler hands over the output data to your subclass of Exporter.
Since this is the most use-case dependant part of any workflow, we do not provide any useful default implementation.
Note: ExampleExporter is only used as a placeholder for tests or dry runs.
Roadmap
- [x] Implement more advanced recovery
- [x] Add example workflows (refer in README)
- [x] Finalise initial README
- [ ] Add Python docstring
See the open issues for a full list of proposed features, known issues and user questions.
License
Distributed under the MIT License. See LICENSE.txt for more information.
Contact
Use the issue tracker for any questions concerning this repository
Project Link: https://github.com/beeldengeluid/dane-workflows
Codemeta.json requirements: https://github.com/CLARIAH/clariah-plus/blob/main/requirements/software-metadata-requirements.md
Owner
- Name: Beeld & Geluid
- Login: beeldengeluid
- Kind: organization
- Location: Netherlands
- Website: https://beeldengeluid.nl
- Twitter: beeldengeluid
- Repositories: 37
- Profile: https://github.com/beeldengeluid
The Netherlands Institute for Sound and Vision (NISV)
CodeMeta (codemeta.json)
{
"@context": "https://doi.org/10.5063/schema/codemeta-2.0",
"@type": "SoftwareSourceCode",
"codeRepository": "https://github.com/beeldengeluid/dane-workflows",
"dateCreated": "2022-07-18",
"issueTracker": "https://github.com/beeldengeluid/dane-workflows/issues",
"name": "dane-workflows",
"version": "0.9.0",
"description": "Python library for setting up simple data processing workflows (using DANE)",
"applicationCategory": "Multimedia processing",
"developmentStatus": "wip",
"funder": {
"@type": "Organization",
"name": "CLARIAH",
"url": "https://www.clariah.nl"
},
"programmingLanguage": [
"Python 3"
],
"softwareRequirements": [
"Python 3.10"
],
"author": [
{
"@type": "Person",
"@id": "https://github.com/jblom",
"givenName": "Jaap",
"familyName": "Blom",
"affiliation": {
"@type": "Organization",
"name": "The Netherlands Institute for Sound and Vision"
}
},
{
"@type": "Person",
"@id": "https://github.com/mwigham",
"affiliation": {
"@type": "Organization",
"name": "The Netherlands Institute for Sound and Vision"
}
}
],
"contributor": []
}
GitHub Events
Total
- Member event: 1
Last Year
- Member event: 1
Committers
Last synced: about 3 years ago
All Time
- Total Commits: 149
- Total Committers: 5
- Avg Commits per committer: 29.8
- Development Distribution Score (DDS): 0.383
Top Committers
| Name | Commits | |
|---|---|---|
| jblom | j****m@b****l | 92 |
| KleinRana | r****n@b****l | 32 |
| mwigham | 3****m@u****m | 13 |
| Philo van Kemenade | p****e@b****l | 11 |
| Rana | 4****a@u****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 10
- Total pull requests: 18
- Average time to close issues: 6 months
- Average time to close pull requests: 6 days
- Total issue authors: 5
- Total pull request authors: 5
- Average comments per issue: 3.0
- Average comments per pull request: 0.56
- Merged pull requests: 16
- Bot issues: 0
- Bot pull requests: 1
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- phivk (3)
- jblom (3)
- mwigham (2)
- Veldhoen (1)
- jwassenaarbg (1)
Pull Request Authors
- jblom (13)
- KleinRana (3)
- phivk (2)
- mwigham (1)
- dependabot[bot] (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 16 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 4
- Total maintainers: 1
pypi.org: dane-workflows
Library providing batch upload & monitoring for (DANE) processing environments
- Homepage: https://github.com/beeldengeluid/dane-workflows
- Documentation: https://dane-workflows.readthedocs.io/
- License: Apache-2.0
-
Latest release: 0.2.3
published over 3 years ago
Rankings
Maintainers (1)
Dependencies
- atomicwrites 1.4.1 develop
- attrs 21.4.0 develop
- black 22.6.0 develop
- click 8.1.3 develop
- colorama 0.4.5 develop
- coverage 6.4.2 develop
- flake8 4.0.1 develop
- iniconfig 1.1.1 develop
- mccabe 0.6.1 develop
- mockito 1.3.3 develop
- mypy 0.961 develop
- mypy-extensions 0.4.3 develop
- packaging 21.3 develop
- pathspec 0.9.0 develop
- platformdirs 2.5.2 develop
- pluggy 1.0.0 develop
- py 1.11.0 develop
- pycodestyle 2.8.0 develop
- pyflakes 2.4.0 develop
- pyparsing 3.0.9 develop
- pytest 7.1.2 develop
- pytest-cov 3.0.0 develop
- tomli 2.0.1 develop
- types-requests 2.28.1 develop
- types-urllib3 1.26.16 develop
- typing-extensions 4.3.0 develop
- certifi 2022.6.15
- charset-normalizer 2.1.0
- dane 0.3.2
- elasticsearch7 7.17.4
- idna 3.3
- pika 1.3.0
- pyyaml 6.0
- requests 2.28.1
- types-pyyaml 6.0.10
- urllib3 1.26.10
- yacs 0.1.8
- black ^22.3.0 develop
- flake8 ^4.0.1 develop
- mockito ^1.3.1 develop
- mypy ^0.961 develop
- pytest ^7.1.2 develop
- pytest-cov ^3.0.0 develop
- types-requests ^2.27.31 develop
- dane ^0.3.2
- pika ^1.2.1
- python ^3.10
- requests ^2.28.0
- types-PyYAML ^6.0.10
- yacs ^0.1.8
- actions/cache v2 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- snok/install-poetry v1 composite