CIOPages
DirectoryData & AnalyticsData Governance & CatalogDeequ (Amazon)

Deequ (Amazon)

Open SourceFunded

Automated data quality checks for large-scale datasets on Apache Spark

Visit Website

About Deequ (Amazon)

Deequ is an open-source library developed by Amazon that enables enterprises to define and execute "unit tests for data" on large datasets using Apache Spark. It is designed to measure and ensure data quality early in the data pipeline, preventing errors from propagating to downstream systems or machine learning models. Deequ is particularly suited for organizations handling big data workloads who require scalable, automated data validation and monitoring.

The library supports integration with Spark 3.1 and Java 8, providing APIs for defining data quality constraints and metrics that can be continuously evaluated. Deequ also offers a Python interface called PyDeequ, broadening accessibility for data engineers and scientists. By embedding data quality tests into data workflows, enterprises can improve trust in their data assets, reduce operational risks, and enhance governance practices. Deequ’s open-source nature allows for customization and community-driven enhancements, making it a flexible tool for data governance and cataloging initiatives.

Key Capabilities

  • Automated data quality validation on Apache Spark
  • Definition of data quality constraints and metrics
  • Scalable testing for large datasets
  • Python interface via PyDeequ
  • Integration with Spark-based data pipelines

Integrations

Apache SparkPyDeequ (Python interface)Maven Central for dependency management

This profile was compiled by CIOPages from public sources with AI assistance, and may be incomplete or out of date. It is informational only and not an endorsement. Represent this vendor? or .

Quick Facts

github.com/awslabs/deequ
CategoryData & Analytics
SubcategoryData Governance & Catalog
PricingOpen Source
DeploymentOpen Source
Target SizeEnterprise