-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-53420][BUILD] Upgrade Parquet to 1.16.0 #52165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you always, @pan3793 .
You beat me to it @pan3793 Thanks for creating this PR 🚀 |
@Fokko, I'm looking forward to this release for a long time, since it includes my two patches required by SPARK-52011(#50765) 😄 |
@dongjoon-hyun @Fokko, this release contains a correctness fix, it affects the Parquet 1.15.2 used by Spark branch-4.0. It seems that the Parquet community does not have a plan for another release of 1.15.x yet, not sure if this will be a blocker for the upcoming Spark 4.0.1 |
Thank you for informing me, @pan3793 . |
One test failed due to the Parquet data file size change, it's not a real issue, I opened #52168 to improve the test. |
3f8bb45
to
1fb7939
Compare
Thank you so much always for keeping tracking the upstream RCs, @pan3793 . As a release manager of Apache Spark 4.0.1, as of now, I don't think this is a blocker for Apache Spark 4.0.1 release because
Of course, after we merge this to I'm sure that we agree that the best case for all and world-wide community, Apache Parquet 1.15.3 is released for Apache Spark 4.0.2 in next 3 months. |
@dongjoon-hyun, I agree with your decision, please continue the 4.0.1 release with the existing Parquet 1.15.2 |
Thank you so much again. |
Have run TPC-DS 300G internally, no regression found, I will vote +1 for Parquet 1.16.0 RC2. |
Thank you for adding the result. Could you elaborate about the environment a little more? Is it against |
Exactly, I built OSS Spark without any changes. (so Spark uses Hadoop 3.4.2 client)
Runs in a small YARN cluster, data is stored in HDFS. YARN/HDFS server version is 3.3.6, both shuffle and HDFS use HDD. |
Thank you so much for the details. |
@dongjoon-hyun this should be ready to go. also cc @wangyum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
late LGTM |
What changes were proposed in this pull request?
Parquet Java 1.16.0 Release Notes: https://github.com/apache/parquet-java/releases/tag/apache-parquet-1.16.0
Why are the changes needed?
Keep Parquet update to date, benefit from upstream bugfixes and improvements.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Pass GHA.
Run TPC-DS 300G (query time in seconds), no surprise compared to Parquet 1.15.2
Was this patch authored or co-authored using generative AI tooling?
No.