Skip to content

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a followup of #47856 . It makes the memory tracking more accurate in several places:

  1. In ShuffleExternalSorter/UnsafeExternalSorter, the memory is used by both the sorter itself, and its underlying in-memort sorter (for sorting shuffle partition ids). We need to add them up to calcuate the current memory usage.
  2. In ExternalAppendOnlyUnsafeRowArray, the records are inserted to an in-memory buffer first. If the buffer gets too large (currently based on num records), we switch to UnsafeExternalSorter. The in-memory buffer also needs a memory based threshold

Why are the changes needed?

More accurate memory tracking results to better spill decisions

Does this PR introduce any user-facing change?

No, the feature is not released yet.

How was this patch tested?

existing tests

Was this patch authored or co-authored using generative AI tooling?

no

@cloud-fan
Copy link
Contributor Author

cc @cxzl25 @attilapiros @mridulm

@@ -149,7 +149,7 @@ class ArrowWindowPythonEvaluatorFactory(

private val inMemoryThreshold = conf.windowExecBufferInMemoryThreshold
private val spillThreshold = conf.windowExecBufferSpillThreshold
private val spillSizeThreshold = conf.windowExecBufferSpillSizeThreshold
private val sizeInBytesSpillThreshold = conf.windowExecBufferSpillSizeThreshold
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I reuse the spill threshold config for the ExternalAppendOnlyUnsafeRowArray in-memory buffer threshold as well. We can add new configs if we want (but it will be a lot of configs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant