Log allocation explain for unassigned in desiredBalance computaion #133958

ywangd · 2025-09-02T06:18:30Z

This PR adds allocaiton explain to logs when there are unassigned shards after a converged DesiredBalance computation. The allocation explain prefers unassigned primary over replica. The logging keeps track of one unassigned shard and does not log again until it becomes assigned.

Relates: ES-12797

This PR adds allocaiton explain to logs when there are unassigned shards after a converged DesiredBalance computation. The allocation explain prefers unassigned primary over replica. The logging is frequency capped at one minute by default. Relates: ES-12797

elasticsearchmachine · 2025-09-02T06:18:55Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

ywangd · 2025-09-02T06:25:27Z

This PR adds a logging message similar to the following for desired balance computation.

unassigned shard [[test-index][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=INDEX_CREATED], at[2025-09-02T06:00:07.478Z], delayed=false, allocation_status[no_attempt]], failed_attempts[0]] with allocation decision {"node_allocation_decision":{"can_allocate":"no","allocate_explanation":"Elasticsearch isn't allowed to allocate this shard to any of the nodes in the cluster. Choose a node to which you expect this shard to be allocated, find this node in the node-by-node explanation, and address the reasons which prevent Elasticsearch from allocating this shard there.","node_allocation_decisions":[{"node_id":"node-0","node_name":"","transport_address":"0.0.0.0:5","roles":["data"],"node_decision":"no","weight_ranking":1,"deciders":[{"decider":"same_shard","decision":"NO","explanation":"a copy of this shard is already allocated to this node [[test-index][0], node[node-0], [P], s[STARTED], a[id=uwInxw8VQf-7Vh-s6s15XA], failed_attempts[0], expected_shard_size[0]]"}]}]}}

It is inspired by the investigation of ES-12797. Though the specific investigation has concluded and no longer needs it. It feels like a generally useful information for future investigations. Hence this PR.

In ES-12797, it was suggested to log something similar in DesiredBalancerReconciler. I am not sure whether it is still necessary. My impression is that unassigned shards are usually unassigned in DesiredBalance. In case we do still want similar message for the reconciler, I think a separate follow-up PR is also more reasonable.

DaveCTurner

Looks ok as is, I just left a question about an alternative design.

DaveCTurner · 2025-09-02T07:29:35Z

test/framework/src/main/java/org/elasticsearch/cluster/ESAllocationTestCase.java

@@ -98,6 +99,10 @@ public OptionalDouble getForecastedWriteLoad(IndexMetadata indexMetadata) {
        public void refreshLicense() {}
    };

+    public static final DesiredBalanceShardsAllocator.ShardAllocationExplainer DUMMY_EXPLAINER = (


naming nit: maybe TEST_EXPLAINER or TEST_ONLY_EXPLAINER or something like that?

Sure pushed e1b6207

DaveCTurner · 2025-09-02T07:33:59Z

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

@@ -77,15 +86,27 @@ public class DesiredBalanceComputer {
    private long lastConvergedTimeMillis;
    private long lastNotConvergedLogMessageTimeMillis;
    private Level convergenceLogMsgLevel;
+    private final FrequencyCappedAction logAllocationExplainForUnassigned;


Rather than a simple frequency cap, WDYT about keeping track of the shard we logged and not logging anything while that same shard remains unassigned? I.e. we reset the state only when the unassigned shard becomes assigned or is deleted. Otherwise if there are multiple shards unassigned we may get reports about different shards each time which will be very hard to interpret.

Might also be nice to record the reset in the logs, that way we can see how long a shard might be blocked from assignment.

Sounds like a good idea. I pushed 9a4e07f for it.

…n-for-unassigned

DaveCTurner

LGTM I like it

DiannaHohensee · 2025-09-03T22:33:54Z

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

+        RoutingNodes routingNodes,
+        RoutingAllocation routingAllocation
+    ) {
+        if (allocationExplainLogger.isDebugEnabled()) {


Is there a plan to temporarily turn debug on for the new logger in serverless? Turning it on after the fact in an incident won't be useful, I expect.

I'd be inclined to log at INFO. CONVERGED with unassigned shards shouldn't happen (except for our mystery bug). Unless decider settings are too strict (not expected in serverless), and then the same shard will be unassigned and shouldn't flood the logs.

Yes the plan is to enable the debug log in serverless. I raised ES-12842 for it. I prefer the DEBUG level since it affects stateful as well.

ywangd · 2025-09-03T23:47:06Z

@elasticmachine update branch

BASE=fabf32c3f1dd08a0e51d6b86412bbea7dee50bc7 HEAD=56dfacba4d973bd825df6f7b24a516b307b8fff7 Branch=main

ywangd added 2 commits September 2, 2025 16:13

import

d70df7a

ywangd requested a review from DaveCTurner September 2, 2025 06:18

ywangd added the >non-issue label Sep 2, 2025

ywangd requested a review from a team as a code owner September 2, 2025 06:18

ywangd added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v9.2.0 labels Sep 2, 2025

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Sep 2, 2025

ywangd removed the request for review from a team September 2, 2025 06:25

DaveCTurner reviewed Sep 2, 2025

View reviewed changes

ywangd added 3 commits September 3, 2025 11:30

Merge remote-tracking branch 'origin/main' into log-allocation-explai…

feea82e

…n-for-unassigned

rename

e1b6207

Track one unassigned shard for logging

9a4e07f

ywangd requested a review from DaveCTurner September 3, 2025 03:22

DaveCTurner approved these changes Sep 3, 2025

View reviewed changes

DiannaHohensee reviewed Sep 3, 2025

View reviewed changes

Merge branch 'main' into log-allocation-explain-for-unassigned

830ce51

ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 3, 2025

tweak

56dfacb

elasticsearchmachine merged commit 99ff870 into elastic:main Sep 4, 2025
33 checks passed

ywangd deleted the log-allocation-explain-for-unassigned branch September 4, 2025 00:55

phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Sep 11, 2025

Mirror upstream elastic#133958 as single snapshot commit for AI review

5781848

BASE=fabf32c3f1dd08a0e51d6b86412bbea7dee50bc7 HEAD=56dfacba4d973bd825df6f7b24a516b307b8fff7 Branch=main

phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Sep 11, 2025

Mirror upstream elastic#133958 as single snapshot commit for AI review

c2e33dc

BASE=fabf32c3f1dd08a0e51d6b86412bbea7dee50bc7 HEAD=56dfacba4d973bd825df6f7b24a516b307b8fff7 Branch=main

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Log allocation explain for unassigned in desiredBalance computaion #133958

Log allocation explain for unassigned in desiredBalance computaion #133958

Uh oh!

ywangd commented Sep 2, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Sep 2, 2025

Uh oh!

ywangd commented Sep 2, 2025

Uh oh!

DaveCTurner left a comment

Uh oh!

DaveCTurner Sep 2, 2025

Uh oh!

ywangd Sep 3, 2025

Uh oh!

DaveCTurner Sep 2, 2025

Uh oh!

ywangd Sep 3, 2025

Uh oh!

DaveCTurner left a comment

Uh oh!

DiannaHohensee Sep 3, 2025 •

edited

Loading

Uh oh!

ywangd Sep 3, 2025

Uh oh!

ywangd commented Sep 3, 2025

Uh oh!

Uh oh!

Uh oh!

Log allocation explain for unassigned in desiredBalance computaion #133958

Log allocation explain for unassigned in desiredBalance computaion #133958

Uh oh!

Conversation

ywangd commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 2, 2025

Uh oh!

ywangd commented Sep 2, 2025

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd commented Sep 3, 2025

Uh oh!

Uh oh!

Uh oh!

ywangd commented Sep 2, 2025 •

edited

Loading

DiannaHohensee Sep 3, 2025 •

edited

Loading