-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Feature Request / Improvement
This is a follow-up to the work done in PR #13720 to fix a correctness bug in the RewriteTablePath
action (#13719).
Problem:
The current implementation uses .toLocalIterator()
to collect the results of the manifest rewrite job on the driver. While this is memory-safe, it puts the entire aggregation workload (processing O(N) records and building the final map) sequentially on the driver. This can become a performance bottleneck for tables with a very large number of manifests.
Additionally, as discussed in the original PR, the way the generic RewriteResult
and RewriteContentFileResult
classes are used is awkward for this bottom-up aggregation pattern.
Proposed Solution:
The action should be refactored to use a dedicated, reducible result class to perform the aggregation in parallel on the Spark executors.
This refactoring will improve the scalability of the action and simplify the internal data flow, making the code cleaner and more maintainable.
Query engine
Spark
Willingness to contribute
- I can contribute this improvement/feature independently
- I would be willing to contribute this improvement/feature with guidance from the Iceberg community
- I cannot contribute this improvement/feature at this time