dane-workflows

Python library for working with DANE environments

https://github.com/beeldengeluid/dane-workflows

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.4%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Python library for working with DANE environments

Basic Info
  • Host: GitHub
  • Owner: beeldengeluid
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 551 KB
Statistics
  • Stars: 0
  • Watchers: 3
  • Forks: 0
  • Open Issues: 4
  • Releases: 0
Created almost 4 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Codemeta

README.md

Introduction

Python library for creating "processing workflows" that use DANE environments, which in a nutshell offer, depending on the setup of each environment, an API for some kind of multi-media processing, e.g.:

  • Automatic Speech Recognition
  • Named Entity Extraction
  • Computer Vision algorithms
  • Any kind of Machine Learning algorithm

This Python library is however not limited to using DANE, but cannot also be used to hook up any API that does something with generating certain data from certain input data.

Achitecture

The following image illustrates the dane-workflows architecture:

Image

The following section details more about concepts illustrated in the image.

Definition of a workflow

A workflow is able to iteratively: - obtain input/source data from a DataProvider - send it to a ProcessingEnvironment (e.g. DANE environment) - wait for the processing environment to complete its work - obtain results from the processing environment - pass results to an Exporter, which typically reconsiles the processed data with the source data

As mentioned in the definition of a workflow, this Python library works with the following components/concepts:

TaskScheduler

Main process that handles all the steps described in the Definition of a workflow

StatusHandler

Keeps track of the workflow status, esuring recovery after crashes. By default the status is persisted to a SQLite database file, using the SQLiteStatusHandler but other implementations can be made by subclassing StatusHandler.

StatusMonitor

Note: This component is currently implemented and not yet available.

Runs on top of the StatusHandler database and visualises the overall progress of a workflow in a human-readable manner (e.g. show the % of successfully/failed processed items)

DataProvider

Iteratively called by the TaskScheduler to obtain a new batch of source data. No default implementations are available (yet), since there are many possible ways one would want to supply data to a system. Simply subclass from DataProvider to have full control over your input flow.

DataProcessingEnvironment

Iteratively called by the TaskScheduler to submit batches of data to an (external) processing environment. Also takes care of obtaining the output of finished processes from such an environment.

This library contains a full implementation, DANEEnvironment, for interacting with DANE environments, but other environments/APIs can be supported by subclassing from ProcessingEnvironment.

Exporter

Called by the TaskScheduler with output data from a processing environment. No default implementation is available (yet), since this is typically the most use-case sensitive part of any workflow, meaning you should decide what to do with the output data (by subclassing Exporter).

Getting started

Prerequisites

  • Python >= 3.8 <= 3.10
  • Poetry

Installation

Install via pypi.org, latest version, using e.g.

pip install dane-workflows

local development

Run poetry install. After completion run:

poetry shell

To test the contents of this repository works well, run:

./scripts/check-project.sh

TODO finalise

Usage

After installing dane-workflows in your local environment, you can run an example workflow with:

python main.py

This example script uses config-example.yml to configure and run a workflow using the following implementations:

  • DataProvider: ExampleDataProvider (with two dummy input documents)
  • DataProcessingEnvironment: ExampleDataProcessingEnvironment (mocks processing environment)
  • StatusHandler: SQLiteStatusHandler (writes output to ./proc_stats/all_stats.db)
  • Exporter: ExampleExporter (does nothing with results)

To setup a workflow for your own purposes, consider the following:

What data do I want to process?

We've provided the ExampleDataProvider to easily feed a workflow with a couple of files (via config.yml). This is mostly for testing out your workflow.

Mostly likely you'll need to implement your own DataProvider by subclassing it. This way you can e.g. load your input data from a database, spreadsheet or whatever else you need.

Which processing environment will I use?

Since this project is developed to at least interface with running DANE environments we've provided DANEEnvironment as a default implementation of DataProcessingEnvironment.

In case you'd like to call any other tool for processing your data, you're required to implement a subclass of DataProcessingEnvironment.

What I will I do with the output of the processing environment?

After your DataProcessingEnvironment has processed a batch of items from your DataProvider the TaskScheduler hands over the output data to your subclass of Exporter.

Since this is the most use-case dependant part of any workflow, we do not provide any useful default implementation.

Note: ExampleExporter is only used as a placeholder for tests or dry runs.

Roadmap

  • [x] Implement more advanced recovery
  • [x] Add example workflows (refer in README)
  • [x] Finalise initial README
  • [ ] Add Python docstring

See the open issues for a full list of proposed features, known issues and user questions.

License

Distributed under the MIT License. See LICENSE.txt for more information.

Contact

Use the issue tracker for any questions concerning this repository

Project Link: https://github.com/beeldengeluid/dane-workflows

Codemeta.json requirements: https://github.com/CLARIAH/clariah-plus/blob/main/requirements/software-metadata-requirements.md

Owner

  • Name: Beeld & Geluid
  • Login: beeldengeluid
  • Kind: organization
  • Location: Netherlands

The Netherlands Institute for Sound and Vision (NISV)

CodeMeta (codemeta.json)

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "codeRepository": "https://github.com/beeldengeluid/dane-workflows",
  "dateCreated": "2022-07-18",
  "issueTracker": "https://github.com/beeldengeluid/dane-workflows/issues",
  "name": "dane-workflows",
  "version": "0.9.0",
  "description": "Python library for setting up simple data processing workflows (using DANE)",
  "applicationCategory": "Multimedia processing",
  "developmentStatus": "wip",
  "funder": {
    "@type": "Organization",
    "name": "CLARIAH",
    "url": "https://www.clariah.nl"
  },
  "programmingLanguage": [
    "Python 3"
  ],
  "softwareRequirements": [
    "Python 3.10"
  ],
  "author": [
    {
      "@type": "Person",
      "@id": "https://github.com/jblom",
      "givenName": "Jaap",
      "familyName": "Blom",
      "affiliation": {
        "@type": "Organization",
        "name": "The Netherlands Institute for Sound and Vision"
      }
    },
    {
      "@type": "Person",
      "@id": "https://github.com/mwigham",
      "affiliation": {
        "@type": "Organization",
        "name": "The Netherlands Institute for Sound and Vision"
      }
    }
  ],
  "contributor": []
}

GitHub Events

Total
  • Member event: 1
Last Year
  • Member event: 1

Committers

Last synced: about 3 years ago

All Time
  • Total Commits: 149
  • Total Committers: 5
  • Avg Commits per committer: 29.8
  • Development Distribution Score (DDS): 0.383
Top Committers
Name Email Commits
jblom j****m@b****l 92
KleinRana r****n@b****l 32
mwigham 3****m@u****m 13
Philo van Kemenade p****e@b****l 11
Rana 4****a@u****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 10
  • Total pull requests: 18
  • Average time to close issues: 6 months
  • Average time to close pull requests: 6 days
  • Total issue authors: 5
  • Total pull request authors: 5
  • Average comments per issue: 3.0
  • Average comments per pull request: 0.56
  • Merged pull requests: 16
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • phivk (3)
  • jblom (3)
  • mwigham (2)
  • Veldhoen (1)
  • jwassenaarbg (1)
Pull Request Authors
  • jblom (13)
  • KleinRana (3)
  • phivk (2)
  • mwigham (1)
  • dependabot[bot] (1)
Top Labels
Issue Labels
enhancement (5) bug (3) VisXP (1)
Pull Request Labels
enhancement (1) dependencies (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 16 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 4
  • Total maintainers: 1
pypi.org: dane-workflows

Library providing batch upload & monitoring for (DANE) processing environments

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 16 Last month
Rankings
Dependent packages count: 6.6%
Average: 26.7%
Downloads: 26.9%
Forks count: 30.5%
Dependent repos count: 30.6%
Stargazers count: 39.1%
Maintainers (1)
Last synced: 11 months ago

Dependencies

poetry.lock pypi
  • atomicwrites 1.4.1 develop
  • attrs 21.4.0 develop
  • black 22.6.0 develop
  • click 8.1.3 develop
  • colorama 0.4.5 develop
  • coverage 6.4.2 develop
  • flake8 4.0.1 develop
  • iniconfig 1.1.1 develop
  • mccabe 0.6.1 develop
  • mockito 1.3.3 develop
  • mypy 0.961 develop
  • mypy-extensions 0.4.3 develop
  • packaging 21.3 develop
  • pathspec 0.9.0 develop
  • platformdirs 2.5.2 develop
  • pluggy 1.0.0 develop
  • py 1.11.0 develop
  • pycodestyle 2.8.0 develop
  • pyflakes 2.4.0 develop
  • pyparsing 3.0.9 develop
  • pytest 7.1.2 develop
  • pytest-cov 3.0.0 develop
  • tomli 2.0.1 develop
  • types-requests 2.28.1 develop
  • types-urllib3 1.26.16 develop
  • typing-extensions 4.3.0 develop
  • certifi 2022.6.15
  • charset-normalizer 2.1.0
  • dane 0.3.2
  • elasticsearch7 7.17.4
  • idna 3.3
  • pika 1.3.0
  • pyyaml 6.0
  • requests 2.28.1
  • types-pyyaml 6.0.10
  • urllib3 1.26.10
  • yacs 0.1.8
pyproject.toml pypi
  • black ^22.3.0 develop
  • flake8 ^4.0.1 develop
  • mockito ^1.3.1 develop
  • mypy ^0.961 develop
  • pytest ^7.1.2 develop
  • pytest-cov ^3.0.0 develop
  • types-requests ^2.27.31 develop
  • dane ^0.3.2
  • pika ^1.2.1
  • python ^3.10
  • requests ^2.28.0
  • types-PyYAML ^6.0.10
  • yacs ^0.1.8
.github/workflows/test-all-branches.yml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • snok/install-poetry v1 composite