Skip to content

Spark: RewriteTablePath: Refactoring of RewriteTablePathSparkAction and RewriteTablePathUtil #13932

@vaultah

Description

@vaultah

Feature Request / Improvement

This is a follow-up to the work done in PR #13720 to fix a correctness bug in the RewriteTablePath action (#13719).

Problem:
The current implementation uses .toLocalIterator() to collect the results of the manifest rewrite job on the driver. While this is memory-safe, it puts the entire aggregation workload (processing O(N) records and building the final map) sequentially on the driver. This can become a performance bottleneck for tables with a very large number of manifests.

Additionally, as discussed in the original PR, the way the generic RewriteResult and RewriteContentFileResult classes are used is awkward for this bottom-up aggregation pattern.

Proposed Solution:
The action should be refactored to use a dedicated, reducible result class to perform the aggregation in parallel on the Spark executors.

This refactoring will improve the scalability of the action and simplify the internal data flow, making the code cleaner and more maintainable.

Query engine

Spark

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    improvementPR that improves existing functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions