https://github.com/bluebrain/data-validation-framework

Simple framework to create data validation workflows.

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.7%) to scientific vocabulary

Keywords

data data-analysis python validation validation-tool

Last synced: 5 months ago · JSON representation

Repository

Simple framework to create data validation workflows.

Basic Info

Host: GitHub
Owner: BlueBrain
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://data-validation-framework.readthedocs.io
Size: 511 KB

Statistics

Stars: 7
Watchers: 4
Forks: 1
Open Issues: 0
Releases: 19

Topics

data data-analysis python validation validation-tool

Created about 4 years ago · Last pushed 6 months ago

Metadata Files

Readme Changelog Contributing License Authors

Data Validation Framework

This project provides simple tools to create data validation workflows. The workflows are based on the luigi library.

The main objective of this framework is to gather in a same place both the specifications that the data must follow and the code that actually tests the data. This avoids having multiple documents to store the specifications and a repository to store the code.

Installation

This package should be installed using pip:

bash pip install data-validation-framework

Usage

Building a workflow

Building a new workflow is simple, as you can see in the following example:

```python import luigi import datavalidationframework as dvf

class ValidationTask1(dvf.task.ElementValidationTask): """Use the class docstring to describe the specifications of the ValidationTask1."""

output_columns = {"col_name": None}

@staticmethod
def validation_function(row, output_path, *args, **kwargs):
    # Return the validation result for one row of the dataset
    if row["col_name"] <= 10:
        return dvf.result.ValidationResult(is_valid=True)
    else:
        return dvf.result.ValidationResult(
            is_valid=False,
            ret_code=1,
            comment="The value should always be <= 10"
        )

def externalvalidationfunction(df, outputpath, args, *kwargs): # Update the dataset inplace here by setting values to the 'isvalid' column. # The 'ret_code' and 'comment' values are optional, they will be added to the report # in order to help the user to understand why the dataset did not pass the validation.

# We can use the value from kwargs["param_value"] here.
if int(kwargs["param_value"]) <= 10:
    df["is_valid"] = True
else:
    df["is_valid"] = False
    df["ret_code"] = 1
    df["comment"] = "The value should always be <= 10"

class ValidationTask2(dvf.task.SetValidationTask): """In some cases you might want to keep the docstring to describe what a developer needs to know, not the end-user. In this case, you can use the __specifications__ attribute to store the specifications."""

a_parameter = luigi.Parameter()

__specifications__ = """Use the __specifications__ to describe the specifications of the
ValidationTask2."""

def inputs(self):
    return {ValidationTask1: {"col_name": "new_col_name_in_current_task"}}

def kwargs(self):
    return {"param_value": self.a_parameter}

validation_function = staticmethod(external_validation_function)

class ValidationWorkflow(dvf.task.ValidationWorkflow): """Use the global workflow specifications to give general context to the end-user."""

def inputs(self):
    return {
        ValidationTask1: {},
        ValidationTask2: {},
    }

```

Where the ValidationWorkflow class only defines the sub-tasks that should be called for the validation. The sub-tasks can be either a dvf.task.ElementValidationTask or a dvf.task.SetValidationTask. In both cases, you can define relations between these sub-tasks since one could need the result of another one to run properly. This is defined in two steps:

in the required task, a output_columns attribute should be defined so that the next tasks can know what data is available, as shown in the previous example for the ValidationTask1.
in the task that requires another task, a inputs method should be defined, as shown in the previous example for the ValidationTask2.

The sub-classes of dvf.task.ElementValidationTask should return a dvf.result.ValidationResult object. The sub-classes of dvf.task.SetValidationTask should return a Pandas.DataFrame object with at least the following columns ["is_valid", "ret_code", "comment", "exception"] and with the same index as the input dataset.

Generate the specifications of a workflow

The specifications that the data should follow can be generated with the following luigi command:

bash luigi --module test_validation ValidationWorkflow --log-level INFO --local-scheduler --result-path out --ValidationTask2-a-parameter 15 --specifications-only

Running a workflow

The workflow can be run with the following luigi command (note that the module test_validation must be available in your sys.path):

bash luigi --module test_validation ValidationWorkflow --log-level INFO --local-scheduler --dataset-df dataset.csv --result-path out --ValidationTask2-a-parameter 15

This workflow will generate the following files:

out/report_ValidationWorkflow.pdf: the PDF validation report.
out/ValidationTask1/report.csv: The CSV containing the validity values of the task ValidationTask1.
out/ValidationTask2/report.csv: The CSV containing the validity values of the task ValidationTask2.
out/ValidationWorkflow/report.csv: The CSV containing the validity values of the complete workflow.

.. note::

As any `luigi <https://luigi.readthedocs.io/en/stable>`_ workflow, the values can be stored
into a `luigi.cfg` file instead of being passed to the CLI.

Advanced features

Require a regular Luigi task

In some cases, one want to execute a regular Luigi task in a validation workflow. In this case, it is possible to use the extra_requires() method to pass these extra requirements. In the validation task it is then possible to get the targets of these extra requirements using the extra_input() method.

```python class TestTaskA(luigi.Task):

def run(self):
    # Do something and write the 'target.file'

def output(self):
    return target.OutputLocalTarget("target.file")

class TestTaskB(task.SetValidationTask):

output_columns = {"extra_target_path": None}

def kwargs(self):
    return {"extra_task_target_path": self.extra_input().path}

def extra_requires(self):
    return TestTaskA()

@staticmethod
def validation_function(df, output_path, *args, **kwargs):
    df["is_valid"] = True
    df["extra_target_path"] = kwargs["extra_task_target_path"]

```

Funding & Acknowledgment

The development of this software was supported by funding to the Blue Brain Project, a research center of the École polytechnique fédérale de Lausanne (EPFL), from the Swiss government’s ETH Board of the Swiss Federal Institutes of Technology.

For license and authors, see LICENSE.txt and AUTHORS.md respectively.

Owner

Name: The Blue Brain Project
Login: BlueBrain
Kind: organization
Email: bbp.opensource@epfl.ch
Location: Geneva, Switzerland

Website: https://portal.bluebrain.epfl.ch/
Repositories: 226
Profile: https://github.com/BlueBrain

Open Source Software produced and used by the Blue Brain Project

GitHub Events

Total

Release event: 1
Delete event: 5
Issue comment event: 1
Push event: 8
Pull request event: 7
Fork event: 1
Create event: 5

Last Year

Release event: 1
Delete event: 5
Issue comment event: 1
Push event: 8
Pull request event: 7
Fork event: 1
Create event: 5

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 1
Total pull requests: 65
Average time to close issues: about 2 months
Average time to close pull requests: about 11 hours
Total issue authors: 1
Total pull request authors: 4
Average comments per issue: 0.0
Average comments per pull request: 0.48
Merged pull requests: 63
Bot issues: 0
Bot pull requests: 10

Past Year

Issues: 0
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: about 16 hours
Issue authors: 0
Pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.33
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 2

View more stats

Top Authors

Issue Authors

adrien-berchet (1)

Pull Request Authors

adrien-berchet (58)
dependabot[bot] (15)
alex4200 (1)
bbpgithubaudit (1)

Top Labels

Issue Labels

Pull Request Labels

dependencies (15) github_actions (1)

Dependencies

.github/workflows/codeql.yml actions

actions/checkout v3 composite
github/codeql-action/analyze v2 composite
github/codeql-action/autobuild v2 composite
github/codeql-action/init v2 composite

.github/workflows/commitlint.yml actions

actions/checkout v3 composite
actions/setup-node v3 composite

.github/workflows/publish-sdist.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite
pypa/gh-action-pypi-publish release/v1 composite

.github/workflows/run-tox.yml actions

actions/cache v3 composite
actions/checkout v3 composite
actions/setup-python v4 composite
actions/upload-artifact v3 composite
awalsh128/cache-apt-pkgs-action latest composite
codecov/codecov-action v3 composite
mikepenz/action-junit-report v3 composite

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/bluebrain/data-validation-framework

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Data Validation Framework

Installation

Usage

Building a workflow

Generate the specifications of a workflow

Running a workflow

Advanced features

Require a regular Luigi task

Funding & Acknowledgment

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies