[SPARK-49386][CORE][SQL][FOLLOWUP] More accurate memory tracking for memory based spill threshold #52190

cloud-fan · 2025-09-01T16:43:46Z

What changes were proposed in this pull request?

This is a followup of #47856 . It makes the memory tracking more accurate in several places:

In ShuffleExternalSorter/UnsafeExternalSorter, the memory is used by both the sorter itself, and its underlying in-memort sorter (for sorting shuffle partition ids). We need to add them up to calcuate the current memory usage.
In ExternalAppendOnlyUnsafeRowArray, the records are inserted to an in-memory buffer first. If the buffer gets too large (currently based on num records), we switch to UnsafeExternalSorter. The in-memory buffer also needs a memory based threshold

Why are the changes needed?

More accurate memory tracking results to better spill decisions

Does this PR introduce any user-facing change?

No, the feature is not released yet.

How was this patch tested?

existing tests

Was this patch authored or co-authored using generative AI tooling?

no

cloud-fan · 2025-09-01T16:44:33Z

cc @cxzl25 @attilapiros @mridulm

cloud-fan · 2025-09-01T16:48:02Z

...src/main/scala/org/apache/spark/sql/execution/python/ArrowWindowPythonEvaluatorFactory.scala

@@ -149,7 +149,7 @@ class ArrowWindowPythonEvaluatorFactory(

    private val inMemoryThreshold = conf.windowExecBufferInMemoryThreshold
    private val spillThreshold = conf.windowExecBufferSpillThreshold
-    private val spillSizeThreshold = conf.windowExecBufferSpillSizeThreshold
+    private val sizeInBytesSpillThreshold = conf.windowExecBufferSpillSizeThreshold


For now I reuse the spill threshold config for the ExternalAppendOnlyUnsafeRowArray in-memory buffer threshold as well. We can add new configs if we want (but it will be a lot of configs).

core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java

…alSorter.java

core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java

…alSorter.java

More accurate memory tracking for memory based spill threshold

b586774

github-actions bot added SQL CORE PYTHON labels Sep 1, 2025

cloud-fan commented Sep 1, 2025

View reviewed changes

cloud-fan commented Sep 2, 2025

View reviewed changes

core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java Show resolved Hide resolved

Update core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExtern…

350754b

…alSorter.java

cloud-fan commented Sep 2, 2025

View reviewed changes

core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java Outdated Show resolved Hide resolved

Update core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExtern…

4920958

…alSorter.java

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-49386][CORE][SQL][FOLLOWUP] More accurate memory tracking for memory based spill threshold #52190

[SPARK-49386][CORE][SQL][FOLLOWUP] More accurate memory tracking for memory based spill threshold #52190

cloud-fan commented Sep 1, 2025

Uh oh!

cloud-fan commented Sep 1, 2025

Uh oh!

cloud-fan Sep 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[SPARK-49386][CORE][SQL][FOLLOWUP] More accurate memory tracking for memory based spill threshold #52190

Are you sure you want to change the base?

[SPARK-49386][CORE][SQL][FOLLOWUP] More accurate memory tracking for memory based spill threshold #52190

Conversation

cloud-fan commented Sep 1, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan commented Sep 1, 2025

Uh oh!

cloud-fan Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!