-
-
Notifications
You must be signed in to change notification settings - Fork 128
Open
Labels
data-validationIssues related to checking whether data meets our quality expectations.Issues related to checking whether data meets our quality expectations.dbtIssues related to the data build tool aka dbtIssues related to the data build tool aka dbtdeveloper experienceThings that make the developers' lives easier, but don't necessarily directly improve the data.Things that make the developers' lives easier, but don't necessarily directly improve the data.
Description
Overview
The custom dbt test check_row_counts_per_partition
has two arguments:
table_name
partition_expr
The table name can be computed from the default model
argument for any instance of the test run against a parquet file (currently all instances).
The partition expression is blank in ~70 instances.
If we add defaults to these arguments, we can convert the check_row_counts_per_partition
entry to a single line in ~70 schema files, and reduce duplication in the remaining couple hundred.
Example:
before:
data_tests:
- expect_columns_not_all_null
- check_row_counts_per_partition:
arguments:
table_name: core_eia__codes_balancing_authorities
partition_expr:
columns:
- name: code
- name: label
after:
data_tests:
- expect_columns_not_all_null
- check_row_counts_per_partition
columns:
- name: code
- name: label
Success Criteria
How will we know that we're done?
-
check_row_counts_per_partition
no longer requires the table_name argument unless used to test something other than a parquet file -
check_row_counts_per_partition
no longer requires the partition_expr if the expression would be blank - No schema file lists the
table_name
argument unless testing something other than a parquet file - No schema file lists the
partition_expr
argument unless the expression is non-empty
Next steps
- draft changes to
check_row_counts_per_partition
- write a script to apply the changes to all the schema files
- one of:
- acquire the will to
git add -p
only the changes to the rowcounts spec and not the other changes resulting from re-serializing each yaml file (4-600 little yes/no/split decisions) - write a different script that doesn't require re-serializing all the yaml files
- acquire the will to
Metadata
Metadata
Assignees
Labels
data-validationIssues related to checking whether data meets our quality expectations.Issues related to checking whether data meets our quality expectations.dbtIssues related to the data build tool aka dbtIssues related to the data build tool aka dbtdeveloper experienceThings that make the developers' lives easier, but don't necessarily directly improve the data.Things that make the developers' lives easier, but don't necessarily directly improve the data.
Type
Projects
Status
Icebox