Skip to content

Add defaults to check_row_counts_per_partition to simplify schema files #4564

@krivard

Description

@krivard

Overview

The custom dbt test check_row_counts_per_partition has two arguments:

  • table_name
  • partition_expr

The table name can be computed from the default model argument for any instance of the test run against a parquet file (currently all instances).

The partition expression is blank in ~70 instances.

If we add defaults to these arguments, we can convert the check_row_counts_per_partition entry to a single line in ~70 schema files, and reduce duplication in the remaining couple hundred.

Example:

before:

        data_tests:
          - expect_columns_not_all_null
          - check_row_counts_per_partition:
              arguments:
                table_name: core_eia__codes_balancing_authorities
                partition_expr:
        columns:
          - name: code
          - name: label

after:

        data_tests:
          - expect_columns_not_all_null
          - check_row_counts_per_partition
        columns:
          - name: code
          - name: label

Success Criteria

How will we know that we're done?

  • check_row_counts_per_partition no longer requires the table_name argument unless used to test something other than a parquet file
  • check_row_counts_per_partition no longer requires the partition_expr if the expression would be blank
  • No schema file lists the table_name argument unless testing something other than a parquet file
  • No schema file lists the partition_expr argument unless the expression is non-empty

Next steps

  • draft changes to check_row_counts_per_partition
  • write a script to apply the changes to all the schema files
  • one of:
    • acquire the will to git add -p only the changes to the rowcounts spec and not the other changes resulting from re-serializing each yaml file (4-600 little yes/no/split decisions)
    • write a different script that doesn't require re-serializing all the yaml files

Metadata

Metadata

Assignees

No one assigned

    Labels

    data-validationIssues related to checking whether data meets our quality expectations.dbtIssues related to the data build tool aka dbtdeveloper experienceThings that make the developers' lives easier, but don't necessarily directly improve the data.

    Type

    No type

    Projects

    Status

    Icebox

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions