-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Feature Request / Improvement
Problem Statement
When executing MERGE operations that affect a large number of partitions, Iceberg currently processes the entire operation atomically as a single logical operation.
This means all affected partitions are included in one execution plan, requiring Spark to shuffle data across all partitions simultaneously.
The planner cannot isolate partition-specific operations, even though partitions are physically independent storage units.
This creates excessive shuffle overhead in Spark that scales poorly with the number of affected partitions, even when data volume is modest.
We've observed that this shuffle dominates processing time, and reducing batch size doesn't help if the partition count remains high.
Proposed Enhancement
Enhance Iceberg to natively support partition-aware parallelism for MERGE operations:
- Add capability to automatically split MERGE operations by partition boundaries.
- Leverage partition independence to reduce shuffle scope per operation.
- Process partition groups concurrently while maintaining proper commit semantics.
Impact
This enhancement would:
- Improve performance for workloads with high partition counts.
- Reduce resource consumption for merge operations.
- Better handle late-arriving data affecting historical partitions.
- Eliminate need for application-level workarounds.
Current Workaround
Workaround would be an application-level solution that:
- Splits batch into partition-specific chunks.
- Processes chunks in parallel with controlled concurrency.
- Each chunk contains a small subset of partitions to update
- Each chunk runs a separate MERGE operation on its partition set
- Multiple chunks execute concurrently up to a configured limit
- Reduces shuffle cost per operation by limiting partition scope.
Downsides of Workaround
- Increased Complexity: Requires custom application logic to handle chunking, concurrency control, and error handling.
- Metadata Bloat: Multiple small commits generate more snapshots, requiring more aggressive compaction and cleanup.
- Consistency Challenges: Application must ensure idempotent processing across job restarts when chunks are partially processed.
- Parameter Tuning: Requires careful configuration of commit retry parameters to handle increased concurrency.
- Limited Optimization: Application has less visibility into underlying storage than Iceberg itself would.
- Resource Contention: Concurrent operations can lead to resource contention at the metadata service level.
Iceberg version 1.9.1.
Spark Version 3.5.5.
Query engine
Spark
Willingness to contribute
- I can contribute this improvement/feature independently
- I would be willing to contribute this improvement/feature with guidance from the Iceberg community
- I cannot contribute this improvement/feature at this time