refactor(writer): Refactor writers for the future partitioning writers #1657

CTTY · 2025-09-07T01:58:36Z

Which issue does this PR close?

Closes Decouple ParquetWriter and LocationGenerator #1650

What changes are included in this PR?

Refactored the writer layers; from a bird’s-eye view, the structure now looks like this:

flowchart TD
    subgraph PartitioningWriter
        PW[PartitioningWriter]

        subgraph DataFileWriter
            RW[DataFileWriter]

            subgraph RollingWriter
                DFW[RollingWriter]

                subgraph FileWriter
                    FW[FileWriter]
                end

                DFW --> FW
            end

            RW --> DFW
        end

        PW --> RW
    end

Key Changes

Enhanced Partition Handling
- Added PartitionKey accessor methods and utility functions
- Introduced PartitioningWriter and PartitioningWriterBuilder traits
Refactored File Writer Architecture
- Modified RollingFileWriter to handle location generator, file name generator, and partition keys directly
- Simplified ParquetWriterBuilder interface to accept output files during build
- Restructured DataFileWriterBuilder to use RollingFileWriter with partition keys
- Updated DataFusion integration to work with the new writer architecture

NOTE: Technically DataFusion or any engine should use TaskWriter -> PartitioningWriter -> RollingWriter -> ..., but TaskWriter and PartitioningWriter are not included in this draft so far

Are these changes tested?

Not yet, but changing the existing tests accordingly should be enough

ZENOTME · 2025-09-07T13:40:08Z

crates/iceberg/src/writer/file_writer/rolling_writer.rs

-    /// Creates a new `RollingFileWriterBuilder` with the specified inner builder and target size.
+impl<B, L, F> RollingWriter<B, L, F>
+where
+    B: IcebergWriterBuilder,


One thing need to noticed is that following is what IcebergWriterBuilder looks like.

#[async_trait::async_trait] pub trait IcebergWriterBuilder<I = DefaultInput, O = DefaultOutput>: Send + Clone + 'static { /// The associated writer type. type R: IcebergWriter<I, O>; /// Build the iceberg writer. async fn build(self) -> Result<Self::R>; }

For writer like position delete writer, it has different input like following, see: #704

#[async_trait::async_trait] impl<B: FileWriterBuilder> IcebergWriterBuilder<Vec<PositionDeleteInput>> for PositionDeleteWriterBuilder<B> { type R = PositionDeleteWriter<B>; async fn build(self) -> Result<Self::R> { Ok(PositionDeleteWriter { inner_writer: Some(self.inner.build().await?), partition_value: self.partition_value.unwrap_or(Struct::empty()), }) } }

And that's why rolling writer is a FileWriter at first. After we adopt this design, how can we something like

RollingWriter<PostitionDeletWriter>

I think your concern is valid, we may need to expose I and O in the RollingWriter as well, and that should solve this problem?

pub struct RollingWriter<B, L, F, I, O> where B: IcebergWriterBuilder<I, O>, L: LocationGenerator, F: FileNameGenerator,

Meanwhile I've been wondering how useful is the abstraction of IcebergWriter... If we separate RollingWriter into RollingPositionalDeletesWriter and RollingXXXWriter and have them use concrete types then this would be a lot easier

Meanwhile I've been wondering how useful is the abstraction of IcebergWriter

E.g the user want to custom their own writer with to track some metrics like following:

RollingWriter<TrackPositionalDeletesWriter>

I think custom writers can either implement FileWriter (lightweighted, file-level customization) or PartitioningWriter (heavier, customization across multiple partitions).

In your example, the custom writer can implement FileWriter and be used like this:

RollingPositionalDeletesWriter<TrackWriter>

I think custom writers can either implement FileWriter (lightweighted, file-level customization) or PartitioningWriter (heavier, customization across multiple partitions).

E.g. user want to access PositionDeleteInput directly.

pub struct RollingWriter<B, L, F, I, O> where B: IcebergWriterBuilder<I, O>, L: LocationGenerator, F: FileNameGenerator,

I think this way can be easier to extend in the future and give more feasible to let user custom. But both way looks good to me if it introduce too much unnecessary complication.

CTTY · 2025-09-09T18:37:05Z

crates/iceberg/src/writer/mod.rs

+
 /// The builder for iceberg writer.
 #[async_trait::async_trait]
 pub trait IcebergWriterBuilder<I = DefaultInput, O = DefaultOutput>:


I believe we will also need to change the DefaultOutput for IcebergWriter from Vec<DataFile> to Vec<DataFileBuilder> since IcebergWriter is no longer the outermost writer

liurenjie1024 · 2025-09-16T09:58:40Z

Hi, @CTTY Seems this is not updated following our discussion?

CTTY · 2025-09-17T00:08:48Z

Hi @liurenjie1024 , do you mean that we should also include TaskWriter and have TaskWriter to split batches by partition? This draft mainly focuses on refactoring the existing layers and have RollingWriter to become the top-level writer as of now, and I haven't incoporated this with an actual partitioning writer or task writer yet. Or do you think it's better to have everything in one draft?

liurenjie1024 · 2025-09-16T09:57:10Z

crates/iceberg/src/writer/mod.rs

 type DefaultOutput = Vec<DataFile>;

+/// The partitioning writer used to write data to multiple partitions.
+pub trait PartitioningWriter {


Add #[async_trait] annotation?

liurenjie1024 · 2025-09-16T09:57:34Z

crates/iceberg/src/writer/mod.rs

-    /// Build the iceberg writer.
-    async fn build(self) -> Result<Self::R>;
+    /// Build the iceberg writer with the provided output file.
+    async fn build(self, output_file: OutputFile) -> Result<Self::R>;


We don't need this change per our discussion?

liurenjie1024 · 2025-09-18T09:22:44Z

Hi @liurenjie1024 , do you mean that we should also include TaskWriter and have TaskWriter to split batches by partition? This draft mainly focuses on refactoring the existing layers and have RollingWriter to become the top-level writer as of now, and I haven't incoporated this with an actual partitioning writer or task writer yet. Or do you think it's better to have everything in one draft?

Hi, @CTTY I'm not saying we should include TaskWriter. Per our discussion, we should have following dependency:

PartitionedWriter
           |
 DataFileWriter(EqDeleateWriter, PositionDeleteWriter) -> This layer is IcebergWriter
          |
RollingFileWriter 
         |
FileWriter(Parquet, ORC)    -> This layer  is file format writer

CTTY · 2025-09-21T03:35:20Z

crates/iceberg/src/writer/partitioning/clustered.rs

+// ///
+// /// Once a partition has been written to and closed, any further attempts
+// /// to write to that partition will result in an error.
+// pub struct ClusteredWriter<B: IcebergWriterBuilder, I: Default + Send = DefaultInput, O: Default + Send = DefaultOutput>


Please ignore this for now, I think it's better to keep this draft/round of changes focused on the interfaces changes with existing writer

crates/iceberg/src/writer/file_writer/rolling_writer.rs

liurenjie1024

Thanks @CTTY for this pr. I think we are on the right track.

liurenjie1024 · 2025-09-22T09:51:48Z

crates/iceberg/src/writer/base_writer/data_file_writer.rs

-    partition_value: Struct,
-    partition_spec_id: i32,
+pub struct DataFileWriter<B: FileWriterBuilder, L: LocationGenerator, F: FileNameGenerator> {
+    inner_writer: Option<RollingFileWriter<B, L, F>>,


This doesn't need to be Option?

Yeah I agree, it's that we need to take the rolling writer when closing the DataFileWriter, and I haven't found a nice solution to do it without Option, and I also don't think it makes much sense to implement Default for RollingFileWriter

crates/iceberg/src/writer/file_writer/rolling_writer.rs

liurenjie1024 · 2025-09-22T09:54:50Z

crates/iceberg/src/writer/partitioning/mod.rs

+use crate::writer::{DefaultInput, DefaultOutput};
+
+#[async_trait::async_trait]
+pub trait PartitioningWriterBuilder<I = DefaultInput, O = DefaultOutput>:


I don't think we need a builder for partition writer?

ZENOTME reviewed Sep 7, 2025

View reviewed changes

CTTY force-pushed the ctty/idk-partition branch from ad66fa5 to ac264fc Compare September 9, 2025 18:23

CTTY mentioned this pull request Sep 9, 2025

Decouple ParquetWriter and LocationGenerator #1650

Open

CTTY commented Sep 9, 2025

View reviewed changes

liurenjie1024 reviewed Sep 18, 2025

View reviewed changes

CTTY added 2 commits September 18, 2025 23:44

partitionhead

1d88be2

little clean up and add partitioning writer traits

2ac588f

CTTY force-pushed the ctty/idk-partition branch from ac264fc to 2ac588f Compare September 21, 2025 03:26

CTTY commented Sep 21, 2025

View reviewed changes

crates/iceberg/src/writer/file_writer/rolling_writer.rs Outdated Show resolved Hide resolved

CTTY requested a review from liurenjie1024 September 21, 2025 03:37

liurenjie1024 reviewed Sep 22, 2025

View reviewed changes

some cleanup

d887733

refactor(writer): Refactor writers for the future partitioning writers #1657

Are you sure you want to change the base?

refactor(writer): Refactor writers for the future partitioning writers #1657

Conversation

CTTY commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Key Changes

Are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CTTY Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 commented Sep 16, 2025

Uh oh!

CTTY commented Sep 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 commented Sep 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CTTY commented Sep 7, 2025 •

edited

Loading

CTTY Sep 8, 2025 •

edited

Loading