-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Log allocation explain for unassigned in desiredBalance computaion #133958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log allocation explain for unassigned in desiredBalance computaion #133958
Conversation
This PR adds allocaiton explain to logs when there are unassigned shards after a converged DesiredBalance computation. The allocation explain prefers unassigned primary over replica. The logging is frequency capped at one minute by default. Relates: ES-12797
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
This PR adds a logging message similar to the following for desired balance computation.
It is inspired by the investigation of ES-12797. Though the specific investigation has concluded and no longer needs it. It feels like a generally useful information for future investigations. Hence this PR. In ES-12797, it was suggested to log something similar in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok as is, I just left a question about an alternative design.
@@ -98,6 +99,10 @@ public OptionalDouble getForecastedWriteLoad(IndexMetadata indexMetadata) { | |||
public void refreshLicense() {} | |||
}; | |||
|
|||
public static final DesiredBalanceShardsAllocator.ShardAllocationExplainer DUMMY_EXPLAINER = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming nit: maybe TEST_EXPLAINER
or TEST_ONLY_EXPLAINER
or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure pushed e1b6207
@@ -77,15 +86,27 @@ public class DesiredBalanceComputer { | |||
private long lastConvergedTimeMillis; | |||
private long lastNotConvergedLogMessageTimeMillis; | |||
private Level convergenceLogMsgLevel; | |||
private final FrequencyCappedAction logAllocationExplainForUnassigned; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a simple frequency cap, WDYT about keeping track of the shard we logged and not logging anything while that same shard remains unassigned? I.e. we reset the state only when the unassigned shard becomes assigned or is deleted. Otherwise if there are multiple shards unassigned we may get reports about different shards each time which will be very hard to interpret.
Might also be nice to record the reset in the logs, that way we can see how long a shard might be blocked from assignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good idea. I pushed 9a4e07f for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM I like it
RoutingNodes routingNodes, | ||
RoutingAllocation routingAllocation | ||
) { | ||
if (allocationExplainLogger.isDebugEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a plan to temporarily turn debug on for the new logger in serverless? Turning it on after the fact in an incident won't be useful, I expect.
I'd be inclined to log at INFO. CONVERGED with unassigned shards shouldn't happen (except for our mystery bug). Unless decider settings are too strict (not expected in serverless), and then the same shard will be unassigned and shouldn't flood the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the plan is to enable the debug log in serverless. I raised ES-12842 for it. I prefer the DEBUG level since it affects stateful as well.
@elasticmachine update branch |
BASE=fabf32c3f1dd08a0e51d6b86412bbea7dee50bc7 HEAD=56dfacba4d973bd825df6f7b24a516b307b8fff7 Branch=main
BASE=fabf32c3f1dd08a0e51d6b86412bbea7dee50bc7 HEAD=56dfacba4d973bd825df6f7b24a516b307b8fff7 Branch=main
This PR adds allocaiton explain to logs when there are unassigned shards after a converged DesiredBalance computation. The allocation explain prefers unassigned primary over replica. The logging keeps track of one unassigned shard and does not log again until it becomes assigned.
Relates: ES-12797