Skip to content

Conversation

nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Sep 2, 2025

We already publish the 50th, 90th and 99th percentile queue latencies for time tracking thread pools, this PR adds a metric that publishes the most recent max queue latency observed by org.elasticsearch.action.admin.cluster.node.usage.TransportNodeUsageStatsForThreadPoolsAction

I added a new metrics class to make it available to the action, I'm open to the idea of putting these metrics somewhere else, it didn't seem like we had an obvious home for them currently. There is DesiredBalanceMetrics but that strikes me as being more focused on the balancer rather than the deciders.

Relates: ES-12631

@nicktindall nicktindall added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Sep 2, 2025
@nicktindall nicktindall marked this pull request as ready for review September 2, 2025 06:28
@nicktindall nicktindall requested a review from a team as a code owner September 2, 2025 06:28
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Sep 2, 2025
Comment on lines 292 to 293
safeAwait(delayLatch);
safeAwait(delayLatch);
Copy link
Contributor

@mhl-b mhl-b Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need a barrier? I think countdown is enough to block thread-pool and let one task sit in queue.

        var latch = new CountDownLatch(1);
        range(0, threads + 1).forEach(i -> pool.execute(() -> safeAwait(latch)));
        var queueTimeMs = between(100,200);
        safeSleep(queueTimeMs);
        latch.countDown();

@mhl-b
Copy link
Contributor

mhl-b commented Sep 2, 2025

There is DesiredBalanceMetrics but that strikes me as being more focused on the balancer rather than the deciders.

I don't mind having queue metrics in DesiredBalanceMetrics, at least, everything related to allocation in one place. Also allocators and balancer tend to use same metrics.

@nicktindall nicktindall requested a review from Copilot September 3, 2025 02:56
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new metric that publishes the maximum queue latency for the write thread pool to support write load constraint decisions. The metric is collected by TransportNodeUsageStatsForThreadPoolsAction and exposed through the existing DesiredBalanceMetrics system.

  • Refactors DesiredBalanceMetrics to be injected as a dependency instead of created internally
  • Adds max queue latency metric registration and collection functionality
  • Updates all test constructors to use the new DesiredBalanceMetrics.NOOP instance

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
DesiredBalanceMetrics.java Adds NOOP instance and method to register write load decider max latency gauge
DesiredBalanceShardsAllocator.java Changes constructor to accept DesiredBalanceMetrics instead of creating internally
ClusterModule.java Creates and injects DesiredBalanceMetrics instance
TransportNodeUsageStatsForThreadPoolsAction.java Adds max queue latency metric collection and registration
WriteLoadConstraintDeciderIT.java Adds integration test for the new metric
Test files Updates all test constructors to use DesiredBalanceMetrics.NOOP

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

…max_queue_latency

# Conflicts:
#	server/src/internalClusterTest/java/org/elasticsearch/cluster/routing/allocation/decider/WriteLoadConstraintDeciderIT.java
@nicktindall nicktindall requested a review from mhl-b September 3, 2025 03:38
@nicktindall nicktindall force-pushed the ES-12631_add_metrics_max_queue_latency branch from 8c35a43 to 5f4cf5f Compare September 4, 2025 03:41
Copy link
Contributor

@mhl-b mhl-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nicktindall nicktindall merged commit 880d52f into elastic:main Sep 4, 2025
33 checks passed
@nicktindall nicktindall deleted the ES-12631_add_metrics_max_queue_latency branch September 4, 2025 23:35
jbaiera pushed a commit to jbaiera/elasticsearch that referenced this pull request Sep 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants