https://github.com/awslabs/python-deequ

Python API for Deequ

https://github.com/awslabs/python-deequ

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.1%) to scientific vocabulary

Keywords from Contributors

labels
Last synced: 10 months ago · JSON representation

Repository

Python API for Deequ

Basic Info
  • Host: GitHub
  • Owner: awslabs
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: master
  • Homepage:
  • Size: 3.28 MB
Statistics
  • Stars: 789
  • Watchers: 16
  • Forks: 144
  • Open Issues: 121
  • Releases: 8
Created over 5 years ago · Last pushed over 1 year ago
Metadata Files
Readme Contributing License Code of conduct

README.md

PyDeequ

PyDeequ is a Python API for Deequ, a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. PyDeequ is written to support usage of Deequ in Python.

License Coverage

There are 4 main components of Deequ, and they are: - Metrics Computation: - Profiles leverages Analyzers to analyze each column of a dataset. - Analyzers serve here as a foundational module that computes metrics for data profiling and validation at scale. - Constraint Suggestion: - Specify rules for various groups of Analyzers to be run over a dataset to return back a collection of constraints suggested to run in a Verification Suite. - Constraint Verification: - Perform data validation on a dataset with respect to various constraints set by you.
- Metrics Repository - Allows for persistence and tracking of Deequ runs over time.

🎉 Announcements 🎉

  • NEW!!! The 1.4.0 release of Python Deequ has been published to PYPI https://pypi.org/project/pydeequ/. This release adds support for Spark 3.5.0.
  • The latest version of Deequ, 2.0.7, is made available With Python Deequ 1.3.0.
  • 1.1.0 release of Python Deequ has been published to PYPI https://pypi.org/project/pydeequ/. This release brings many recent upgrades including support up to Spark 3.3.0! Any feedbacks are welcome through github issues.
  • With PyDeequ v0.1.8+, we now officially support Spark3 ! Just make sure you have an environment variable SPARK_VERSION to specify your Spark version!
  • We've release a blogpost on integrating PyDeequ onto AWS leveraging services such as AWS Glue, Athena, and SageMaker! Check it out: Monitor data quality in your data lake using PyDeequ and AWS Glue.
  • Check out the PyDeequ Release Announcement Blogpost with a tutorial walkthrough the Amazon Reviews dataset!
  • Join the PyDeequ community on PyDeequ Slack to chat with the devs!

Quickstart

The following will quickstart you with some basic usage. For more in-depth examples, take a look in the tutorials/ directory for executable Jupyter notebooks of each module. For documentation on supported interfaces, view the documentation.

Installation

You can install PyDeequ via pip.

pip install pydeequ

Set up a PySpark session

```python from pyspark.sql import SparkSession, Row import pydeequ

spark = (SparkSession .builder .config("spark.jars.packages", pydeequ.deequmavencoord) .config("spark.jars.excludes", pydeequ.f2jmavencoord) .getOrCreate())

df = spark.sparkContext.parallelize([ Row(a="foo", b=1, c=5), Row(a="bar", b=2, c=6), Row(a="baz", b=3, c=None)]).toDF() ```

Analyzers

```python from pydeequ.analyzers import *

analysisResult = AnalysisRunner(spark) \ .onData(df) \ .addAnalyzer(Size()) \ .addAnalyzer(Completeness("b")) \ .run()

analysisResultdf = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult) analysisResultdf.show() ```

Profile

```python from pydeequ.profiles import *

result = ColumnProfilerRunner(spark) \ .onData(df) \ .run()

for col, profile in result.profiles.items(): print(profile) ```

Constraint Suggestions

```python from pydeequ.suggestions import *

suggestionResult = ConstraintSuggestionRunner(spark) \ .onData(df) \ .addConstraintRule(DEFAULT()) \ .run()

Constraint Suggestions in JSON format

print(suggestionResult) ```

Constraint Verification

```python from pydeequ.checks import * from pydeequ.verification import *

check = Check(spark, CheckLevel.Warning, "Review Check")

checkResult = VerificationSuite(spark) \ .onData(df) \ .addCheck( check.hasSize(lambda x: x >= 3) \ .hasMin("b", lambda x: x == 0) \ .isComplete("c") \ .isUnique("a") \ .isContainedIn("a", ["foo", "bar", "baz"]) \ .isNonNegative("b")) \ .run()

checkResultdf = VerificationResult.checkResultsAsDataFrame(spark, checkResult) checkResultdf.show() ```

Repository

Save to a Metrics Repository by adding the useRepository() and saveOrAppendResult() calls to your Analysis Runner. ```python from pydeequ.repository import * from pydeequ.analyzers import *

metricsfile = FileSystemMetricsRepository.helpermetricsfile(spark, 'metrics.json') repository = FileSystemMetricsRepository(spark, metricsfile) keytags = {'tag': 'pydeequ hello world'} resultKey = ResultKey(spark, ResultKey.currentmillitime(), keytags)

analysisResult = AnalysisRunner(spark) \ .onData(df) \ .addAnalyzer(ApproxCountDistinct('b')) \ .useRepository(repository) \ .saveOrAppendResult(resultKey) \ .run() ```

To load previous runs, use the repository object to load previous results back in.

python result_metrep_df = repository.load() \ .before(ResultKey.current_milli_time()) \ .forAnalyzers([ApproxCountDistinct('b')]) \ .getSuccessMetricsAsDataFrame()

Wrapping up

After you've ran your jobs with PyDeequ, be sure to shut down your Spark session to prevent any hanging processes.

python spark.sparkContext._gateway.shutdown_callback_server() spark.stop()

Contributing

Please refer to the contributing doc for how to contribute to PyDeequ.

License

This library is licensed under the Apache 2.0 License.


Contributing Developer Setup

  1. Setup SDKMAN
  2. Setup Java
  3. Setup Apache Spark
  4. Install Poetry
  5. Run tests locally

Setup SDKMAN

SDKMAN is a tool for managing parallel Versions of multiple Software Development Kits on any Unix based system. It provides a convenient command line interface for installing, switching, removing and listing Candidates. SDKMAN! installs smoothly on Mac OSX, Linux, WSL, Cygwin, etc... Support Bash and ZSH shells. See documentation on the SDKMAN! website.

Open your favourite terminal and enter the following:

```bash $ curl -s https://get.sdkman.io | bash If the environment needs tweaking for SDKMAN to be installed, the installer will prompt you accordingly and ask you to restart.

Next, open a new terminal or enter:

$ source "$HOME/.sdkman/bin/sdkman-init.sh"

Lastly, run the following code snippet to ensure that installation succeeded:

$ sdk version ```

Setup Java

Install Java Now open favourite terminal and enter the following:

```bash List the AdoptOpenJDK OpenJDK versions $ sdk list java

To install For Java 11 $ sdk install java 11.0.10.hs-adpt

To install For Java 11 $ sdk install java 8.0.292.hs-adpt ```

Setup Apache Spark

Install Java Now open favourite terminal and enter the following:

```bash List the Apache Spark versions: $ sdk list spark

To install For Spark 3 $ sdk install spark 3.0.2 ```

Poetry

Poetry Commands

```bash poetry install

poetry update

--tree: List the dependencies as a tree.

--latest (-l): Show the latest version.

--outdated (-o): Show the latest version but only for packages that are outdated.

poetry show -o ```

Running Tests Locally

Take a look at tests in tests/dataquality and tests/jobs

bash $ poetry run pytest

Running Tests Locally (Docker)

If you have issues installing the dependencies listed above, another way to run the tests and verify your changes is through Docker. There is a Dockerfile that will install the required dependencies and run the tests in a container.

docker build . -t spark-3.3-docker-test docker run spark-3.3-docker-test

Owner

  • Name: Amazon Web Services - Labs
  • Login: awslabs
  • Kind: organization
  • Location: Seattle, WA

AWS Labs

GitHub Events

Total
  • Issues event: 4
  • Watch event: 72
  • Issue comment event: 6
  • Push event: 2
  • Pull request review event: 2
  • Pull request event: 9
  • Fork event: 11
  • Create event: 2
Last Year
  • Issues event: 4
  • Watch event: 72
  • Issue comment event: 6
  • Push event: 2
  • Pull request review event: 2
  • Pull request event: 9
  • Fork event: 11
  • Create event: 2

Committers

Last synced: about 3 years ago

All Time
  • Total Commits: 43
  • Total Committers: 11
  • Avg Commits per committer: 3.909
  • Development Distribution Score (DDS): 0.651
Past Year
  • Commits: 6
  • Committers: 4
  • Avg Commits per committer: 1.5
  • Development Distribution Score (DDS): 0.5
Top Committers
Name Email Commits
Calvin Wang c****8@g****m 15
Calvin Wang c****n@a****m 14
cghyzel c****l@a****m 4
Lucas Cardozo l****o@g****m 3
Serge Smertin 2****x@u****m 1
Joan Aoanan 4****6@u****m 1
rdsharma26 6****6@u****m 1
Yusup y****p@l****m 1
ChethanUK c****1@g****m 1
MOHACGCG 6****G@u****m 1
dependabot[bot] 4****]@u****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 146
  • Total pull requests: 84
  • Average time to close issues: 6 months
  • Average time to close pull requests: 4 months
  • Total issue authors: 110
  • Total pull request authors: 31
  • Average comments per issue: 2.49
  • Average comments per pull request: 1.18
  • Merged pull requests: 41
  • Bot issues: 0
  • Bot pull requests: 16
Past Year
  • Issues: 7
  • Pull requests: 16
  • Average time to close issues: N/A
  • Average time to close pull requests: about 1 month
  • Issue authors: 5
  • Pull request authors: 6
  • Average comments per issue: 0.43
  • Average comments per pull request: 0.19
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 1
Top Authors
Issue Authors
  • chenliu0831 (4)
  • Sankeernalk (4)
  • evoisec (3)
  • komashk (3)
  • nikie (3)
  • ankiiitraj (2)
  • dineshvelmuruga (2)
  • dilkushpatel (2)
  • vandanavk (2)
  • sbbagal13 (2)
  • SadamAyubNBS (2)
  • thvasilo (2)
  • ml6cz (2)
  • WiktorMadejski (2)
  • poolis (2)
Pull Request Authors
  • dependabot[bot] (25)
  • rdsharma26 (19)
  • chenliu0831 (10)
  • komashk (8)
  • nikie (7)
  • poolis (4)
  • chethanuk (3)
  • lecardozo (3)
  • iWantToKeepAnon (2)
  • WiktorMadejski (2)
  • stevenayers (2)
  • anqini (2)
  • gucciwang (2)
  • rjurney (1)
  • ghost (1)
Top Labels
Issue Labels
bug (13) feature request (10) question (9) enhancement (6) blocked (4) help wanted (3) environment (3) duplicate (3) researching (3) documentation (3) good first issue (3) release (1) dependencies (1) Deequ (1)
Pull Request Labels
dependencies (25) blocked (2) enhancement (1)

Packages

  • Total packages: 5
  • Total downloads:
    • pypi 14,570,244 last-month
  • Total docker downloads: 6,732,908
  • Total dependent packages: 8
    (may contain duplicates)
  • Total dependent repositories: 53
    (may contain duplicates)
  • Total versions: 25
  • Total maintainers: 3
pypi.org: pydeequ

PyDeequ - Unit Tests for Data

  • Versions: 14
  • Dependent Packages: 8
  • Dependent Repositories: 53
  • Downloads: 14,570,200 Last month
  • Docker Downloads: 6,732,908
Rankings
Downloads: 0.1%
Docker downloads count: 0.7%
Dependent packages count: 1.6%
Average: 1.9%
Dependent repos count: 2.0%
Stargazers count: 2.6%
Forks count: 4.1%
Maintainers (2)
Last synced: over 1 year ago
pypi.org: pydeequ-alb

PyDeequ - Unit Tests for Data

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Stargazers count: 2.9%
Forks count: 4.6%
Average: 4.7%
Dependent packages count: 4.8%
Dependent repos count: 6.3%
Last synced: over 1 year ago
proxy.golang.org: github.com/awslabs/python-deequ
  • Versions: 7
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 5.7%
Average: 5.9%
Dependent repos count: 6.0%
Last synced: 10 months ago
pypi.org: pydeequ2

PyDeequ2 - aws clone

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 44 Last month
Rankings
Stargazers count: 2.8%
Forks count: 4.3%
Dependent packages count: 6.6%
Average: 13.2%
Downloads: 21.5%
Dependent repos count: 30.6%
Maintainers (1)
Last synced: 10 months ago
conda-forge.org: pydeequ

PyDeequ is a Python API for Deequ, a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. PyDeequ is written to support usage of Deequ in Python.

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Forks count: 17.1%
Stargazers count: 17.1%
Average: 29.8%
Dependent repos count: 34.0%
Dependent packages count: 51.2%
Last synced: 10 months ago

Dependencies

docs/source/requirements.txt pypi
  • pandas *
  • pyspark *
  • recommonmark *
poetry.lock pypi
  • appdirs 1.4.4 develop
  • atomicwrites 1.4.0 develop
  • attrs 21.2.0 develop
  • black 21.5b1 develop
  • bleach 3.3.0 develop
  • certifi 2020.12.5 develop
  • cffi 1.14.5 develop
  • cfgv 3.3.0 develop
  • chardet 4.0.0 develop
  • click 8.0.1 develop
  • colorama 0.4.4 develop
  • coverage 5.5 develop
  • cryptography 3.4.7 develop
  • dataclasses 0.8 develop
  • distlib 0.3.1 develop
  • docutils 0.17.1 develop
  • dparse 0.5.1 develop
  • filelock 3.0.12 develop
  • flake8 3.9.2 develop
  • flake8-docstrings 1.6.0 develop
  • identify 2.2.5 develop
  • idna 2.10 develop
  • importlib-metadata 4.0.1 develop
  • importlib-resources 5.1.4 develop
  • iniconfig 1.1.1 develop
  • jeepney 0.6.0 develop
  • keyring 23.0.1 develop
  • mccabe 0.6.1 develop
  • mypy-extensions 0.4.3 develop
  • nodeenv 1.6.0 develop
  • packaging 20.9 develop
  • pathspec 0.8.1 develop
  • pkginfo 1.7.0 develop
  • pluggy 0.13.1 develop
  • pre-commit 2.13.0 develop
  • py 1.10.0 develop
  • pycodestyle 2.7.0 develop
  • pycparser 2.20 develop
  • pydocstyle 6.1.1 develop
  • pyflakes 2.3.1 develop
  • pygments 2.9.0 develop
  • pyparsing 2.4.7 develop
  • pytest 6.2.4 develop
  • pytest-cov 2.12.0 develop
  • pytest-flake8 1.0.7 develop
  • pytest-rerunfailures 9.1.1 develop
  • pytest-runner 5.3.1 develop
  • pywin32-ctypes 0.2.0 develop
  • pyyaml 5.4.1 develop
  • readme-renderer 29.0 develop
  • regex 2021.4.4 develop
  • requests 2.25.1 develop
  • requests-toolbelt 0.9.1 develop
  • rfc3986 1.5.0 develop
  • safety 1.10.3 develop
  • secretstorage 3.3.1 develop
  • snowballstemmer 2.1.0 develop
  • toml 0.10.2 develop
  • tqdm 4.60.0 develop
  • twine 3.4.1 develop
  • typed-ast 1.4.3 develop
  • typing-extensions 3.10.0.0 develop
  • urllib3 1.26.4 develop
  • virtualenv 20.4.6 develop
  • webencodings 0.5.1 develop
  • zipp 3.4.1 develop
  • numpy 1.19.5
  • pandas 1.1.5
  • py4j 0.10.9
  • pyspark 3.0.2
  • python-dateutil 2.8.1
  • pytz 2021.1
  • six 1.16.0
pyproject.toml pypi
  • black ^21.5b1 develop
  • coverage ^5.5 develop
  • flake8 ^3.9.2 develop
  • flake8-docstrings ^1.6.0 develop
  • pre-commit ^2.12.1 develop
  • pytest ^6.2.4 develop
  • pytest-cov ^2.11.1 develop
  • pytest-flake8 ^1.0.7 develop
  • pytest-rerunfailures ^9.1.1 develop
  • pytest-runner ^5.3.0 develop
  • safety ^1.10.3 develop
  • twine ^3.4.1 develop
  • numpy >=1.14.1
  • pandas >=0.23.0
  • pyspark >=2.4.7, <3.1.1
  • python >=3.6.2,<4
.github/workflows/base.yml actions
  • actions/checkout v3 composite
  • actions/setup-java v1 composite
  • actions/setup-python v2 composite