Deequ (Amazon)
Open SourceFundedAutomated data quality checks for large-scale datasets on Apache Spark
About Deequ (Amazon)
Deequ is an open-source library developed by Amazon that enables enterprises to define and execute "unit tests for data" on large datasets using Apache Spark. It is designed to measure and ensure data quality early in the data pipeline, preventing errors from propagating to downstream systems or machine learning models. Deequ is particularly suited for organizations handling big data workloads who require scalable, automated data validation and monitoring.
The library supports integration with Spark 3.1 and Java 8, providing APIs for defining data quality constraints and metrics that can be continuously evaluated. Deequ also offers a Python interface called PyDeequ, broadening accessibility for data engineers and scientists. By embedding data quality tests into data workflows, enterprises can improve trust in their data assets, reduce operational risks, and enhance governance practices. Deequ’s open-source nature allows for customization and community-driven enhancements, making it a flexible tool for data governance and cataloging initiatives.
Key Capabilities
- ✓Automated data quality validation on Apache Spark
- ✓Definition of data quality constraints and metrics
- ✓Scalable testing for large datasets
- ✓Python interface via PyDeequ
- ✓Integration with Spark-based data pipelines
Integrations
Other Data Governance & Catalog Vendors
View allRelated Buyer Guides
Independent evaluation frameworks for this category.
This profile was compiled by CIOPages from public sources with AI assistance, and may be incomplete or out of date. It is informational only and not an endorsement. Represent this vendor? or .