Skip to content

Conversation

pan3793
Copy link
Member

@pan3793 pan3793 commented Aug 28, 2025

What changes were proposed in this pull request?

Parquet Java 1.16.0 Release Notes: https://github.com/apache/parquet-java/releases/tag/apache-parquet-1.16.0

Why are the changes needed?

Keep Parquet update to date, benefit from upstream bugfixes and improvements.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GHA.

Run TPC-DS 300G (query time in seconds), no surprise compared to Parquet 1.15.2

query_name parquet 1.15.2 parquet 1.16.0 RC2
q01.sql 31 29
q02.sql 26 20
q03.sql 9 9
q04.sql 67 67
q05.sql 14 14
q06.sql 8 8
q07.sql 5 5
q08.sql 6 6
q09.sql 13 12
q10.sql 6 6
q11.sql 31 31
q12.sql 4 3
q13.sql 5 5
q14a.sql 52 53
q14b.sql 41 41
q15.sql 4 4
q16.sql 16 16
q17.sql 7 7
q18.sql 6 6
q19.sql 6 6
q20.sql 3 3
q21.sql 5 4
q22.sql 10 10
q23a.sql 76 74
q23b.sql 80 80
q24a.sql 39 39
q24b.sql 32 31
q25.sql 7 7
q26.sql 4 3
q27.sql 3 3
q28.sql 17 16
q29.sql 9 9
q30.sql 9 9
q31.sql 13 13
q32.sql 2 2
q33.sql 5 6
q34.sql 5 5
q35.sql 13 13
q36.sql 4 4
q37.sql 7 6
q38.sql 13 12
q39a.sql 7 7
q39b.sql 7 6
q40.sql 5 6
q41.sql 0 0
q42.sql 1 2
q43.sql 4 4
q44.sql 7 7
q45.sql 3 3
q46.sql 7 7
q47.sql 11 12
q48.sql 5 5
q49.sql 7 7
q50.sql 16 16
q51.sql 15 14
q52.sql 2 2
q53.sql 2 2
q54.sql 4 5
q55.sql 1 2
q56.sql 4 4
q57.sql 12 9
q58.sql 4 6
q59.sql 14 10
q60.sql 7 7
q61.sql 4 3
q62.sql 4 5
q63.sql 2 2
q64.sql 21 21
q65.sql 12 12
q66.sql 6 7
q67.sql 46 45
q68.sql 7 6
q69.sql 10 10
q70.sql 5 5
q71.sql 6 6
q72.sql 17 16
q73.sql 4 4
q74.sql 24 24
q75.sql 24 21
q76.sql 12 10
q77.sql 6 5
q78.sql 31 31
q79.sql 5 5
q80.sql 9 8
q81.sql 8 7
q82.sql 11 11
q83.sql 4 4
q84.sql 5 4
q85.sql 8 7
q86.sql 3 3
q87.sql 14 13
q88.sql 13 13
q89.sql 4 3
q90.sql 4 3
q91.sql 3 4
q92.sql 2 2
q93.sql 21 20
q94.sql 9 9
q95.sql 62 61
q96.sql 5 3
q97.sql 12 11
q98.sql 4 3
q99.sql 6 6

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the BUILD label Aug 28, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you always, @pan3793 .

@Fokko
Copy link
Contributor

Fokko commented Aug 28, 2025

You beat me to it @pan3793 Thanks for creating this PR 🚀

@pan3793
Copy link
Member Author

pan3793 commented Aug 28, 2025

@Fokko, I'm looking forward to this release for a long time, since it includes my two patches required by SPARK-52011(#50765) 😄

@pan3793
Copy link
Member Author

pan3793 commented Aug 28, 2025

@dongjoon-hyun @Fokko, this release contains a correctness fix, it affects the Parquet 1.15.2 used by Spark branch-4.0.

It seems that the Parquet community does not have a plan for another release of 1.15.x yet, not sure if this will be a blocker for the upcoming Spark 4.0.1

@dongjoon-hyun
Copy link
Member

Thank you for informing me, @pan3793 .

@pan3793
Copy link
Member Author

pan3793 commented Aug 28, 2025

One test failed due to the Parquet data file size change, it's not a real issue, I opened #52168 to improve the test.

@dongjoon-hyun
Copy link
Member

I took a look at #52168, but we had better update the expectSize instead of removing a test coverage, @pan3793 .

@dongjoon-hyun
Copy link
Member

Thank you so much always for keeping tracking the upstream RCs, @pan3793 .

As a release manager of Apache Spark 4.0.1, as of now, I don't think this is a blocker for Apache Spark 4.0.1 release because

  • This is a feature release instead of a maintenance release of Parquet (as you mentioned).
  • There is no official Parquet release (still RC3?).
  • After we get the official 1.16.0 release, we still need to spend more efforts to validate more with Apache Spark 4.1.0 (master branch) for a while because Apache Parquet is the default file format of Apache Spark 4.1.0 first.

Of course, after we merge this to master branch, all downstream forks can cherry-pick and use Apache Parquet 1.16.0 at their own risks in their production environment. That's the main purpose to valid and merge this PR to master branch. Personally, I'm also going to verify with internal benchmarks and more production usages if Apache Parquet 1.16.0 RC3 passes the vote.

I'm sure that we agree that the best case for all and world-wide community, Apache Parquet 1.15.3 is released for Apache Spark 4.0.2 in next 3 months.

@pan3793
Copy link
Member Author

pan3793 commented Sep 1, 2025

@dongjoon-hyun, I agree with your decision, please continue the 4.0.1 release with the existing Parquet 1.15.2

@dongjoon-hyun
Copy link
Member

Thank you so much again.

@pan3793
Copy link
Member Author

pan3793 commented Sep 2, 2025

Have run TPC-DS 300G internally, no regression found, I will vote +1 for Parquet 1.16.0 RC2.

@dongjoon-hyun
Copy link
Member

Have run TPC-DS 300G internally, no regression found, I will vote +1 for Parquet 1.16.0 RC2.

Thank you for adding the result. Could you elaborate about the environment a little more? Is it against master branch with Apache Spark 4.1.0-SNAPSHOT and Hadoop 3.4.2 and S3 data (or local files) on a single disk, @pan3793 ?

@pan3793
Copy link
Member Author

pan3793 commented Sep 2, 2025

Is it against master branch with Apache Spark 4.1.0-SNAPSHOT

Exactly, I built OSS Spark without any changes. (so Spark uses Hadoop 3.4.2 client)

... and Hadoop 3.4.2 and S3 data (or local files) on a single disk

Runs in a small YARN cluster, data is stored in HDFS. YARN/HDFS server version is 3.3.6, both shuffle and HDFS use HDD.

@dongjoon-hyun
Copy link
Member

Thank you so much for the details.

@pan3793 pan3793 marked this pull request as ready for review September 3, 2025 03:06
@pan3793
Copy link
Member Author

pan3793 commented Sep 3, 2025

@dongjoon-hyun this should be ready to go. also cc @wangyum

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM for Apache Spark 4.1.0.

Thank you, @pan3793 and @wangyum .

Let's merge this to master branch in order to get more chances of community testings.

@LuciferYang
Copy link
Contributor

late LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants