https://github.com/awslabs/deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
2 of 75 committers (2.7%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary

Keywords

dataquality scala spark unit-testing

Keywords from Contributors

imputation missing-value-handling mlops data-engineering data-profilers workflow-orchestration workflow-engine scheduler orchestration etl

Last synced: 5 months ago · JSON representation

Repository

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Basic Info

Host: GitHub
Owner: awslabs
License: apache-2.0
Language: Scala
Default Branch: master
Homepage:
Size: 69.4 MB

Statistics

Stars: 3,490
Watchers: 76
Forks: 566
Open Issues: 161
Releases: 29

Topics

dataquality scala spark unit-testing

Created over 7 years ago · Last pushed 6 months ago

Metadata Files

Readme Contributing License Code of conduct

Deequ - Unit Tests for Data

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. We are happy to receive feedback and contributions.

Python users may also be interested in PyDeequ, a Python interface for Deequ. You can find PyDeequ on GitHub, readthedocs, and PyPI.

Requirements and Installation

Deequ depends on Java 8. Deequ version 2.x only runs with Spark 3.1, and vice versa. If you rely on a previous Spark version, please use a Deequ 1.x version (legacy version is maintained in legacy-spark-3.0 branch). We provide legacy releases compatible with Apache Spark versions 2.2.x to 3.0.x. The Spark 2.2.x and 2.3.x releases depend on Scala 2.11 and the Spark 2.4.x, 3.0.x, and 3.1.x releases depend on Scala 2.12.

Available via maven central.

Choose the latest release that matches your Spark version from the available versions. Add the release as a dependency to your project. For example, for Spark 3.1.x:

Maven <dependency> <groupId>com.amazon.deequ</groupId> <artifactId>deequ</artifactId> <version>2.0.0-spark-3.1</version> </dependency> sbt libraryDependencies += "com.amazon.deequ" % "deequ" % "2.0.0-spark-3.1"

Example

Deequ's purpose is to "unit-test" data to find errors early, before the data gets fed to consuming systems or machine learning algorithms. In the following, we will walk you through a toy example to showcase the most basic usage of our library. An executable version of the example is available here.

Deequ works on tabular data, e.g., CSV files, database tables, logs, flattened json files, basically anything that you can fit into a Spark dataframe. For this example, we assume that we work on some kind of Item data, where every item has an id, a productName, a description, a priority and a count of how often it has been viewed.

scala case class Item( id: Long, productName: String, description: String, priority: String, numViews: Long )

Our library is built on Apache Spark and is designed to work with very large datasets (think billions of rows) that typically live in a distributed filesystem or a data warehouse. For the sake of simplicity in this example, we just generate a few toy records though.

```scala val rdd = spark.sparkContext.parallelize(Seq( Item(1, "Thingy A", "awesome thing.", "high", 0), Item(2, "Thingy B", "available at http://thingb.com", null, 0), Item(3, null, null, "low", 5), Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10), Item(5, "Thingy E", null, "high", 12)))

val data = spark.createDataFrame(rdd) ```

Most applications that work with data have implicit assumptions about that data, e.g., that attributes have certain types, do not contain NULL values, and so on. If these assumptions are violated, your application might crash or produce wrong outputs. The idea behind deequ is to explicitly state these assumptions in the form of a "unit-test" for data, which can be verified on a piece of data at hand. If the data has errors, we can "quarantine" and fix it, before we feed it to an application.

The main entry point for defining how you expect your data to look is the VerificationSuite from which you can add Checks that define constraints on attributes of the data. In this example, we test for the following properties of our data:

there are 5 rows in total
values of the id attribute are never NULL and unique
values of the productName attribute are never NULL
the priority attribute can only contain "high" or "low" as value
numViews should not contain negative values
at least half of the values in description should contain a url
the median of numViews should be less than or equal to 10

In code this looks as follows:

```scala import com.amazon.deequ.VerificationSuite import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus}

val verificationResult = VerificationSuite() .onData(data) .addCheck( Check(CheckLevel.Error, "unit testing my data") .hasSize(_ == 5) // we expect 5 rows .isComplete("id") // should never be NULL .isUnique("id") // should not contain duplicates .isComplete("productName") // should never be NULL // should only contain the values "high" and "low" .isContainedIn("priority", Array("high", "low")) .isNonNegative("numViews") // should not contain negative values // at least half of the descriptions should contain a url .containsURL("description", _ >= 0.5) // half of the items should have less than 10 views .hasApproxQuantile("numViews", 0.5, _ <= 10)) .run() ```

After calling run, deequ translates your test to a series of Spark jobs, which it executes to compute metrics on the data. Afterwards it invokes your assertion functions (e.g., _ == 5 for the size check) on these metrics to see if the constraints hold on the data. We can inspect the VerificationResult to see if the test found errors:

```scala import com.amazon.deequ.constraints.ConstraintStatus

if (verificationResult.status == CheckStatus.Success) { println("The data passed the test, everything is fine!") } else { println("We found errors in the data:\n")

val resultsForAllConstraints = verificationResult.checkResults .flatMap { case (_, checkResult) => checkResult.constraintResults }

resultsForAllConstraints .filter { _.status != ConstraintStatus.Success } .foreach { result => println(s"${result.constraint}: ${result.message.get}") } } ```

If we run the example, we get the following output: ``` We found errors in the data:

CompletenessConstraint(Completeness(productName)): Value: 0.8 does not meet the requirement! PatternConstraint(containsURL(description)): Value: 0.4 does not meet the requirement! ``The test found that our assumptions are violated! Only 4 out of 5 (80%) of the values of theproductNameattribute are non-null and only 2 out of 5 (40%) values of thedescription` attribute did contain a url. Fortunately, we ran a test and found the errors, somebody should immediately fix the data :)

More examples

Our library contains much more functionality than what we showed in the basic example. We are in the process of adding more examples for its advanced features. So far, we showcase the following functionality:

Persistence and querying of computed metrics of the data with a MetricsRepository
Data profiling of large data sets
Anomaly detection on data quality metrics over time
Automatic suggestion of constraints for large datasets
Incremental metrics computation on growing data and metric updates on partitioned data (advanced)

DQDL (Data Quality Definition Language)

Deequ also supports DQDL, a declarative language for defining data quality rules. DQDL allows you to express data quality constraints in a simple, readable format.

Supported DQDL Rules

RowCount: RowCount < 100
Completeness: Completeness "column" > 0.9
IsComplete: IsComplete "column"
Uniqueness: Uniqueness "column" = 1.0
IsUnique: IsUnique "column"
ColumnCorrelation: ColumnCorrelation "col1" "col2" > 0.8
DistinctValuesCount: DistinctValuesCount "column" = 5
Entropy: Entropy "column" > 2.0
Mean: Mean "column" between 10 and 50
StandardDeviation: StandardDeviation "column" < 5.0
Sum: Sum "column" = 100
UniqueValueRatio: UniqueValueRatio "column" > 0.7
CustomSql: CustomSql "SELECT COUNT(*) FROM primary" > 0
IsPrimaryKey: IsPrimaryKey "column"
ColumnLength: ColumnLength "column" between 1 and 5
ColumnExists: ColumnExists "column"

Scala Example

ScalaDQDLExample.scala

```scala import com.amazon.deequ.dqdl.EvaluateDataQuality import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder() .appName("DQDL Example") .master("local[*]") .getOrCreate()

import spark.implicits._

// Sample data val df = Seq( ("1", "a", "c"), ("2", "a", "c"), ("3", "a", "c"), ("4", "b", "d") ).toDF("item", "att1", "att2")

// Define rules using DQDL syntax val ruleset = """Rules=[IsUnique "item", RowCount < 10, Completeness "item" > 0.8, Uniqueness "item" = 1.0]"""

// Evaluate data quality val results = EvaluateDataQuality.process(df, ruleset) results.show() ```

Java Example

JavaDQDLExample.java

```java import com.amazon.deequ.dqdl.EvaluateDataQuality; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession;

SparkSession spark = SparkSession.builder() .appName("DQDL Java Example") .master("local[*]") .getOrCreate();

// Create sample data Dataset df = spark.sql( "SELECT * FROM VALUES " + "('1', 'a', 'c'), " + "('2', 'a', 'c'), " + "('3', 'a', 'c'), " + "('4', 'b', 'd') " + "AS t(item, att1, att2)" );

// Define rules using DQDL syntax String ruleset = "Rules=[IsUnique \"item\", RowCount < 10, Completeness \"item\" > 0.8, Uniqueness \"item\" = 1.0]";

// Evaluate data quality Dataset results = EvaluateDataQuality.process(df, ruleset); results.show(); ```

Citation

If you would like to reference this package in a research paper, please cite:

Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. Proc. VLDB Endow. 11, 12 (August 2018), 1781-1794.

License

This library is licensed under the Apache 2.0 License.

Owner

Name: Amazon Web Services - Labs
Login: awslabs
Kind: organization
Location: Seattle, WA

Website: http://amazon.com/aws/
Repositories: 914
Profile: https://github.com/awslabs

AWS Labs

GitHub Events

Total

Create event: 29
Commit comment event: 1
Release event: 11
Issues event: 12
Watch event: 199
Issue comment event: 40
Push event: 42
Pull request review comment event: 20
Pull request review event: 49
Pull request event: 65
Fork event: 31

Last Year

Create event: 29
Commit comment event: 1
Release event: 11
Issues event: 12
Watch event: 199
Issue comment event: 40
Push event: 42
Pull request review comment event: 20
Pull request review event: 49
Pull request event: 65
Fork event: 31

Committers

Last synced: 9 months ago

All Time

Total Commits: 251
Total Committers: 75
Avg Commits per committer: 3.347
Development Distribution Score (DDS): 0.857

Past Year

Commits: 17
Committers: 13
Avg Commits per committer: 1.308
Development Distribution Score (DDS): 0.824

Top Committers

Name	Email	Commits
Sebastian	s**n@g**m	36
rdsharma26	6****6	18
Edward Cho	1****m	17
Tom Wollnik	w**k@a**m	15
sseb	s**b@a**m	11
Shuhei Kadowaki	a**k@g**m	11
Robert Ambrus	v**s@e**m	11
Stephan Seufert	4****s	9
Yannis Mentekidis	m****d	7
Philipp Schmidt	t****d	7
Stefan Grafberger	1****r	7
James Siri	j**i@a**m	7
Peng Chen	p**1@a**u	5
Philipp Schmidt	p**d@a**m	4
zeotuan	4****n	4
penikala	1****P	3
Paul Sukow	p**w@a**m	3
York-Winegar, James M	j**r@a**m	3
Zhuo (Joe) Wang	z**g@l**m	3
Andrius	1****l	3
bevhanno	g**b@h**r	3
Josh	5****y	2
Malcolm Greaves	m****s	2
Shashank Sharma	s**8@g**m	2
Vincent Chee Jia Hong	3****e	2
Yannis Mentekidis	m**d@a**m	2
lange-labs	6****s	2
ssc	s**c@a**g	2
Dustin Lange	4****e	2
Samarth	8****1	2
and 45 more...

Committer Domains (Top 20 + Academic)

amazon.com: 8 zetaris.com: 1 qq.com: 1 r.recruit.co.jp: 1 columbia.edu: 1 lavabit.com: 1 live.de: 1 apache.org: 1 hanno.fr: 1 linkedin.com: 1 accenture.com: 1 artemishealth.com: 1 andrew.cmu.edu: 1 expediagroup.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 113
Total pull requests: 177
Average time to close issues: 6 months
Average time to close pull requests: 25 days
Total issue authors: 99
Total pull request authors: 48
Average comments per issue: 1.96
Average comments per pull request: 0.99
Merged pull requests: 118
Bot issues: 0
Bot pull requests: 1

Past Year

Issues: 13
Pull requests: 72
Average time to close issues: about 1 month
Average time to close pull requests: 6 days
Issue authors: 11
Pull request authors: 17
Average comments per issue: 0.69
Average comments per pull request: 0.56
Merged pull requests: 51
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

eapframework (4)
marcantony (4)
asktushar (4)
zeotuan (3)
DivyangPatelIITD (2)
arsenalgunnershubert777 (2)
abhijit401 (1)
jbleduigou (1)
rdsharma26 (1)
aashishkITV (1)
jonathanapp (1)
rostoh (1)
garystafford (1)
psyking841 (1)
vaishnavibv13 (1)

Pull Request Authors

eycho-am (50)
rdsharma26 (33)
zeotuan (16)
happy-coral (12)
mentekid (12)
dariobig (6)
joshuazexter (6)
VenkataKarthikP (5)
kyraman (5)
scott-gunn (4)
shriyavanvari (4)
SamPom100 (4)
D-Minor (4)
arsenalgunnershubert777 (3)
samarth-c1 (3)

Top Labels

Issue Labels

enhancement (18) bug (15) question (15) advanced-feature (1) good first issue (1)

Pull Request Labels

dependencies (2) enhancement (1) help wanted (1)

Packages

Total packages: 1
Total downloads: unknown
Total docker downloads: 18

Total dependent packages: 9
Total dependent repositories: 28
Total versions: 61

repo1.maven.org: com.amazon.deequ:deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Homepage: https://github.com/awslabs/deequ
Documentation: https://appdoc.app/artifact/com.amazon.deequ/deequ/
License: Apache License, Version 2.0
Latest release: 1.0.5
published over 5 years ago

Versions: 61
Dependent Packages: 9
Dependent Repositories: 28
Docker Downloads: 18

Rankings

Dependent repos count: 4.4%

Stargazers count: 6.0%

Average: 6.4%

Dependent packages count: 6.7%

Forks count: 8.6%

Last synced: 6 months ago

Dependencies

.github/workflows/maven.yml actions

actions/checkout v3 composite
actions/setup-java v3 composite

pom.xml maven

org.apache.spark:spark-core_2.12 3.3.0
org.apache.spark:spark-mllib_2.12 3.3.0
org.apache.spark:spark-sql_2.12 3.3.0
org.scala-lang:scala-library 2.12.10
org.scala-lang:scala-reflect 2.12.10
org.scalanlp:breeze_2.12 0.13.2
org.apache.datasketches:datasketches-java 1.3.0-incubating test
org.mockito:mockito-core 2.28.2 test
org.openjdk.jmh:jmh-core 1.23 test
org.openjdk.jmh:jmh-generator-annprocess 1.23 test
org.scala-lang:scala-compiler 2.12.10 test
org.scalamock:scalamock_2.12 4.4.0 test
org.scalatest:scalatest_2.12 3.1.2 test

https://github.com/awslabs/deequ

Science Score: 36.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Deequ - Unit Tests for Data

Requirements and Installation

Example

More examples

DQDL (Data Quality Definition Language)

Supported DQDL Rules

Scala Example

Java Example

Citation

License

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

repo1.maven.org: com.amazon.deequ:deequ

Rankings

Dependencies