dqmarc

Data quality profiling tool for tabular data.

https://github.com/christie-nhs-data-science/dqmarc

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Data quality profiling tool for tabular data.

Basic Info
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 1
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme Contributing License Code of conduct Citation Support

README.md

DQMaRC: A Python Tool for Structured Data Quality Profiling

  • Version: {VERSION}
  • Author: Anthony Lighterness and Michael Adcock
  • License: MIT License and Open Government License v3

Project Status: Active – The project has reached a stable, usable state and is being actively developed.


Overview

DQMaRC (Data Quality Markup and Ready-to-Connect) is a Python tool designed to facilitate comprehensive data quality profiling of structured tabular data. It allows data analysts, engineers, and scientists to systematically assess and manage the quality of their datasets across multiple dimensions including completeness, validity, uniqueness, timeliness, consistency, and accuracy.

DQMaRC can be used both programmatically within Python scripts and interactively through a Shiny web application front-end user interface, making it versatile for different use cases ranging from ad-hoc analysis to integration within larger data pipelines.

Key Features

  • Multi-dimensional Data Quality Checks: Evaluate datasets across key dimensions including Completeness, Validity, Uniqueness, Timeliness, Consistency, and Accuracy.
  • Customisable Test Parameters: Configure data quality test parameters easily via python or a user friendly spreadsheet to tailor your data quality assessment to your dataset.
  • Interactive Shiny App: Setup, run, explore and visualise data quality issues interactively through a Shiny app for Python.
  • Integration with Data Pipelines: Easily integrate DQMaRC into your data processing pipelines for scheduled data quality checks.
  • Detailed Reporting: Generate comprehensive reports detailing data quality issues at both the cell and aggregate levels.

Installation

Using Pip

You can install DQMaRC using pip or conda. Ensure you have a virtual environment activated.

bash pip DQMaRC

Dependencies

The package dependencies are listed in the requirements.txt file and will be installed automatically during the installation of DQMaRC.

Getting Started

1. Import Libraries

Start by importing the necessary libraries and DQMaRC modules in your Python environment.

python import pandas as pd from DQMaRC import DataQuality

2. Load Your Data

Load the dataset you wish to profile.

```python

Load your data

df = pd.readcsv('pathtoyourdata.csv') ```

3. Initialise DQMaRC and Set Test Parameters

Initialise the DQ tool and set the test parameters. You can generate a template or import predefined parameters.

```python

Initialise the Data Quality object

dq = DataQuality(df)

Generate test parameters template

testparams = dq.getparam_template()

(Optional) Load pre-configured test parameters

testparams = pd.readcsv('pathtotest_parameters.csv')

Set the test parameters

dq.settestparams(test_params) ```

4. Run Data Quality Checks

Run the data quality checks across all dimensions.

python dq.run_all_metrics()

5. Retrieve and Save Results

Retrieve the full results and join them with your original dataset for detailed analysis.

```python

Get the full results

fullresults = dq.rawresults()

Join results with the original dataset

dfwithresults = df.join(full_results, how="left")

Save results to a CSV file

dfwithresults.tocsv('pathtosaveresults.csv', index=False) ```

Using the Shiny App

In addition to programmatic usage, DQMaRC includes an interactive Shiny web app for Python that allows users to explore and visualise data quality issues.

You can test the DQMaRC ShinyLive Demo by copying and pasting the URL located HERE into your webbrowser. This link will take you to a ShinyLive Editor where you can test the DQMaRC functionality. If you encounter an error, try refreshing the webpage once or twice. If you still encounter an error after this, please feel free to get in touch by contacting us or raising an issue on our repository.

PLEASE NOTE The ShinyLive UI is recommended only for testing and getting used to the DQMaRC too functionality. This interface is deployed on your machine, meaning it is only as secure as your machine is. It will store data you upload in its local memory before being wiped when you exit the app.

Running the Shiny App

To run the Shiny app, use the following command in your terminal:

bash shiny run --reload --launch-browser path_to_your_app/app.py

Deploying the Shiny App

For deploying the Shiny app on a server, follow the official Shiny for Python deployment guide.

Documentation

Comprehensive documentation for DQMaRC, including detailed API references and user guides, is available HERE or in the project docs/ directory.

Repo Structure

Top-level Structure

```

DQMaRC
│ requirements.txt # package dependencies │ setup.py # setup configuration for the python package distribution │
├───docs # user docs material │ │...

├───DQMaRC # source code │ │ Accuracy.py │ │ app.py │ │ Completeness.py │ │ Consistency.py │ │ DataQuality.py │ │ Dimension.py │ │ Timeliness.py │ │ Uniqueness.py │ │ UtilitiesDQMaRC.py │ │ Validity.py │ │ init.py │ │
│ ├───data # data used in the tutorial(s) │ │ │ DQdffull.csv │ │ │ testparamsdefinitions.csv │ │ │ toydfsubset.csv │ │ │ toydfsubsettestparams24.05.16.csv │ │ │
│ │ └───lookups # data standards and or value lists for data validity checks │ │ LU
toydfgender.csv │ │ LUtoydfICD10v5.csv │ │ LUtoydfMstage.csv │ │ LUtoydftumourstage.csv │ │
│ ├───notebooks │ │ Backend_Tutorial.ipynb # Tutorial for python users │...

```

Contributing

Contributions to DQMaRC are welcome! Please read the CONTRIBUTING.md file for guidelines on how to contribute to this project.

License

DQMaRC is licensed under the MIT License. See the LICENSE file for more details.

Acknowledgments

This project was developed by Anthony Lighterness and Michael Adcock. Special thanks to all contributors and testers who helped in the development of this tool.

Citation

Please use the following citation if you use DQMaRC:

Lighterness, A., Adcock, M.A., and Price, G. (2024). DQMaRC: A Python Tool for Structured Data Quality Profiling (Version 1.0.0) [Software]. Available from https://github.com/christie-nhs-data-science/DQMaRC.

Notice on Maintenance and Support

Please Note: This library is an open-source project maintained by a small team of contributors. While we strive to keep the package updated and well-maintained, ongoing support and development may vary depending on resource availability.

We strongly encourage users to engage with the project by reporting any issues, errors, or suggestions for improvements. Your feedback is invaluable in helping us identify and prioritise areas for improvement. Please feel free to submit questions, bug reports, or feature requests via our GitHub issues page or by reaching out.

Thank you for your understanding and for contributing to the growth and improvement of this project!


For more information, please visit the project repository

Owner

  • Name: The Christie NHS Data Science
  • Login: christie-nhs-data-science
  • Kind: organization
  • Email: business.intelligence@christie.nhs.uk
  • Location: The Christie Hospital, Wilmslow Rd, Manchester

Software by The Christie NHS Foundation Trusts Data Science Team

Citation (CITATION.cff)

cff-version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Lighterness"
  given-names: "Anthony"
  orcid: "https://orcid.org/0000-0001-6898-6265"
- family-names: "Adcock"
  given-names: "Michael Thomas"
  orcid: "https://orcid.org/0009-0009-5111-380X"
- family-names: "Scanlon"
  given-names: "Lauren Abiqail"
  orcid: "https://orcid.org/0000-0001-7380-7145"
- family-names: "Price"
  given-names: "Gareth"
  orcid: "https://orcid.org/0000-0003-4353-3360"
title: "DQMaRC: Data Quality Markup and Ready-to-Connect"
version: 1.0.0
doi: [zenodo doi]
date-released: 2024-10-10
url: "https://github.com/christie-nhs-data-science/DQMaRC"

GitHub Events

Total
  • Release event: 1
  • Watch event: 1
  • Public event: 1
  • Push event: 192
  • Pull request event: 2
  • Fork event: 1
  • Create event: 2
Last Year
  • Release event: 1
  • Watch event: 1
  • Public event: 1
  • Push event: 192
  • Pull request event: 2
  • Fork event: 1
  • Create event: 2

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 29 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 3
  • Total maintainers: 1
pypi.org: dqmarc

A Python Tool for Structured Data Quality Profiling

  • Homepage: https://github.com/christie-nhs-data-science/DQMaRC
  • Documentation: https://christie-nhs-data-science.github.io/DQMaRC/
  • License: Open Government License v3 -------------------------- https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/ You are encouraged to use and re-use the Information that is available under this licence freely and flexibly, with only a few conditions. Using Information under this licence ------------------------------------ Use of copyright and database right material expressly made available under this licence (the 'Information') indicates your acceptance of the terms and conditions below. The Licensor grants you a worldwide, royalty-free, perpetual, non-exclusive licence to use the Information subject to the conditions below. This licence does not affect your freedom under fair dealing or fair use or any other copyright or database right exceptions and limitations. You are free to: - copy, publish, distribute and transmit the Information; - adapt the Information; - exploit the Information commercially and non-commercially for example, by combining it with other Information, or by including it in your own product or application. You must (where you do any of the above): - acknowledge the source of the Information in your product or application by including or linking to any attribution statement specified by the Information Provider(s) and, where possible, provide a link to this licence; If the Information Provider does not provide a specific attribution statement, you must use the following: > Contains public sector information licensed under the Open Government Licence v3.0. If you are using Information from several Information Providers and listing multiple attributions is not practical in your product or application, you may include a URI or hyperlink to a resource that contains the required attribution statements. These are important conditions of this licence and if you fail to comply with them the rights granted to you under this licence, or any similar licence granted by the Licensor, will end automatically. Exemptions ---------- This licence does not cover: - personal data in the Information; - Information that has not been accessed by way of publication or disclosure under information access legislation (including the Freedom of Information Acts for the UK and Scotland) by or with the consent of the Information Provider; - departmental or public sector organisation logos, crests and the Royal Arms except where they form an integral part of a document or dataset; - military insignia; - third party rights the Information Provider is not authorised to license; - other intellectual property rights, including patents, trade marks, and design rights; and - identity documents such as the British Passport Non-endorsement --------------- This licence does not grant you any right to use the Information in a way that suggests any official status or that the Information Provider and/or Licensor endorse you or your use of the Information. No warranty ----------- The Information is licensed 'as is' and the Information Provider and/or Licensor excludes all representations, warranties, obligations and liabilities in relation to the Information to the maximum extent permitted by law. The Information Provider and/or Licensor are not liable for any errors or omissions in the Information and shall not be liable for any loss, injury or damage of any kind caused by its use. The Information Provider does not guarantee the continued supply of the Information. Governing Law ------------- This licence is governed by the laws of the jurisdiction in which the Information Provider has its principal place of business, unless otherwise specified by the Information Provider. Definitions ----------- In this licence, the terms below have the following meanings: 'Information' means information protected by copyright or by database right (for example, literary and artistic works, content, data and source code) offered for use under the terms of this licence. 'Information Provider' means the person or organisation providing the Information under this licence. 'Licensor' means any Information Provider which has the authority to offer Information under the terms of this licence or the Keeper of Public Records, who has the authority to offer Information subject to Crown copyright and Crown database rights and Information subject to copyright and database right that has been assigned to or acquired by the Crown, under the terms of this licence. 'Use' means doing any act which is restricted by copyright or database right, whether in the original medium or in any other medium, and includes without limitation distributing, copying, adapting, modifying as may be technically necessary to use it in a different mode or format. 'You', 'you' and 'your' means the natural or legal person, or body of persons corporate or incorporate, acquiring rights in the Information (whether the Information is obtained directly from the Licensor or otherwise) under this licence. About the Open Government Licence --------------------------------- The National Archives has developed this licence as a tool to enable Information Providers in the public sector to license the use and re-use of their Information under a common open licence. The National Archives invites public sector bodies owning their own copyright and database rights to permit the use of their Information under this licence. The Keeper of the Public Records has authority to license Information subject to copyright and database right owned by the Crown. The extent of the offer to license this Information under the terms of this licence is set out in the UK Government Licensing Framework. http://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/re-use-and-licensing/ukglf/ This is version 3.0 of the Open Government Licence. The National Archives may, from time to time, issue new versions of the Open Government Licence. If you are already using Information under a previous version of the Open Government Licence, the terms of that licence will continue to apply. These terms are compatible with the Creative Commons Attribution License 4.0 and the Open Data Commons Attribution License, both of which license copyright and database rights. This means that when the Information is adapted and licensed under either of those licences, you automatically satisfy the conditions of the OGL when you comply with the other licence. The OGLv3.0 is Open Definition compliant. Further context, best practice and guidance can be found in the UK Government Licensing Framework section on The National Archives website. http://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/ Open Government License for public sector information
  • Latest release: 1.0.4
    published over 1 year ago
  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 29 Last month
Rankings
Dependent packages count: 10.1%
Average: 33.5%
Dependent repos count: 56.9%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/sphinx-build.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
docs/docs_requirements.txt pypi
  • ghp-import *
  • ipydatagrid *
  • ipywidgets *
  • jupyter_sphinx *
  • jupyterlab *
  • nbconvert *
  • nbformat *
  • nbsphinx *
  • numpy <2.0
  • pandas *
  • plotly *
  • pytest *
  • shiny *
  • shinywidgets *
  • sphinx *
  • sphinx-autodoc-typehints *
  • sphinx-book-theme *
  • sphinx-click *
  • sphinx-copybutton *
  • sphinx-tabs *
  • sphinx_rtd_theme *
pyproject.toml pypi
  • ipydatagrid *
  • ipywidgets *
  • jupyterlab *
  • nbconvert *
  • nbformat *
  • numpy <2.0
  • pandas *
  • plotly *
  • shiny *
requirements.txt pypi
  • ipydatagrid *
  • ipywidgets *
  • jupyterlab *
  • nbconvert *
  • nbformat *
  • numpy <2.0
  • pandas *
  • plotly *
  • shiny *
.github/workflows/publish_pypi.yml actions
  • actions/checkout v4 composite
  • actions/download-artifact v4 composite
  • actions/setup-python v5 composite
  • actions/upload-artifact v4 composite
  • pypa/gh-action-pypi-publish release/v1 composite
.github/workflows/unit_tests.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
setup.py pypi