-
Notifications
You must be signed in to change notification settings - Fork 1.2k
📖 Add a design for supporting warm replicas. #3121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Welcome @godwinpang! |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: godwinpang The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @godwinpang. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
||
## Concerns/Questions | ||
1. Controllers opted into this feature will break the workqueue.depth metric as the controller will | ||
have a pre filled queue before it starts processing items. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, one way to avoid this is to use a metrics wrapper that supresses them until the leader election is won. But not sure if its worth bothering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually break the metric? Sounds like the metric will just show the reality
It might break alerts that assume the queue length should be pretty low, but that's an issue of the alerts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might break alerts that assume the queue length should be pretty low, but that's an issue of the alerts.
Not sure I quite agree with that. The alerts work today, if we change the behavior here, we break them. To my knowledge there also isn't a metric that indicates if a given replica is the leader, so I don't even see a good way to unbreak them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but the current definition of the metric is that it should show the length of the queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge there also isn't a metric that indicates if a given replica is the leader
That could be solved, I hope :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- logs via logState
- other workqueue metrics: adds_total, queue_duration_seconds
- Although I guess we can also fake these. What would happen when the controller starts? I assume we would set the length metric immediately to it's correct value. Similar for adds_total and probably also queue_duration_seconds
I also think folks have programmatic access to the queue (at least by instantiating the queue in controller.Options.NewQueue)
So we don't know what kind of things folks are doing with the queue, e.g. accessing queue length or taking items out of the queue even if the leader election controllers are not running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we maybe shouldn't store the items in the queue at this time because that's observable behavior as well (not only through the metric) and not just make it look like the queue is empty through the metric
That makes sense to me. I guess reading between the lines, the suggestion here is that non-leader runnables should The specific behavior I want to solve for is to have the queue ready to go asap after leader election, so would we implement something like "populate queue from cache on winning leader election"? I am not too familiar with the details here so would appreciate some concrete nudges in the right direction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me. I guess reading between the lines, the suggestion here is that non-leader runnables should The specific behavior I want to solve for is to have the queue ready to go asap after leader election, so would we implement something like "populate queue from cache on winning leader election"?
I don't know how that would be possible, we would need to implement a dummy queue to buffer.
Lets punt on this problem until we make this the default behavior, we don't have to solve it right away. For as long as its opt-in and there is a metric indicating if a given replica is a leader, it won't break anyone by default and people can adjust their alerts accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought more about this. Probably the initial suggestion is fine. We should just make sure that the metrics popping up once the controller becomes leader make sense
## Concerns/Questions | ||
1. Controllers opted into this feature will break the workqueue.depth metric as the controller will | ||
have a pre filled queue before it starts processing items. | ||
2. Ideally, non-leader runnables should block readyz and healthz checks until they are in sync. I am |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break conversion webhooks. I don't know if there is a good way to figure out if the binary contains a conversion webhook, but if in doubt we have to retain the current behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this breaks conversion webhooks? Taking a step back, why does controller-runtime offer a way to run conversion webhooks in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this breaks conversion webhooks?
Because they need to be up to sync the cache, so blocking readyz until the cache is ready creates a deadlock.
Taking a step back, why does controller-runtime offer a way to run conversion webhooks in the first place?
Because they are part of some controllers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, we can just put the burden on the user to register readyz as they want then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially we can provide a util func like we did with mgr.GetWebhookServer().StartedChecker()
to make it easier (can be done in a follow-up PR)
But agree we can't add this per default as it can lead to deadlocks for controllers that serve conversion webhooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least in #3192, this has not yet been fully implemented.
cm.runnables.Warmup.Start()
launches a goroutine for each runnable to invoke runnable.Start()
and then returns once all known runnables have been started. After that, the leaderElector is started. In other words, at present, there is no guaranteed ordering between leader election and the completion of cache synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key point is that passing in the ready
parameter is not currently implemented.
if err := r.Warmup.Add(RunnableFunc(warmupRunnable.Warmup), nil); err != nil { |
If the check only passes after the warmup is completed, and before that it blocks cm.runnables.Warmup.Start()
, then the idea of blocking leader election would be feasible.
controller-runtime/pkg/manager/runnable_group.go
Lines 246 to 250 in 9d3997b
if rn.Check(r.ctx) { | |
if rn.signalReady { | |
r.startReadyCh <- rn | |
} | |
} |
controller-runtime/pkg/manager/runnable_group.go
Lines 190 to 210 in 9d3997b
// Wait for all runnables to signal. | |
for { | |
select { | |
case <-ctx.Done(): | |
if err := ctx.Err(); !errors.Is(err, context.Canceled) { | |
retErr = err | |
} | |
case rn := <-r.startReadyCh: | |
for i, existing := range r.startQueue { | |
if existing == rn { | |
// Remove the item from the start queue. | |
r.startQueue = append(r.startQueue[:i], r.startQueue[i+1:]...) | |
break | |
} | |
} | |
// We're done waiting if the queue is empty, return. | |
if len(r.startQueue) == 0 { | |
return | |
} | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I think you're right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After that, the leaderElector is started. In other words, at present, there is no guaranteed ordering between leader election and the completion of cache synchronization.
Yes, this is correct but I think your suggestion of a readyness check means that we won't even start non-leader election runnables before the warmups have finished, which I think is undesirable. Not starting leader-election until it has finished seems fine, but might slow down things in case there was no passive replica, because acquiring the leader lease takes a bit of time.
The other thing this makes me realize is that in order to have rollouts where the cache sync is not part of the downtime we would need to integrate the warmup with a healthcheck, such that the healthcheck only passes once warmup has completed, otherwise what could happen is:
- new pod n1 gets created
- Old pod o1 gets removed and gives up the leader lease
- leader lease gets acquired by old pod o2, as n1 hasn't finished syncing its cache
- new pod n2 gets created
- old pod o2 gets shutdown - If neither n1 or n2 hasn't synced yet, we will again have some cache sync induced downtime
The issue then however is that delaying the healthcheck until syncing finished poses a problem for anyone who uses conversion webhooks, as conversion webhooks may be required to be able to sync, causing a deadlock.
We could probably add a second knob to control if the warmup should be part of the healthcheck and then this can be used by anyone who doesn't have conversion webhooks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not starting leader-election until it has finished seems fine, but might slow down things in case there was no passive replica, because acquiring the leader lease takes a bit of time.
I’m describing this requirement because, in our scenario, we need to strictly control downtime, and synchronizing the cache usually takes several minutes. Therefore, a delay of just a few seconds caused by leader election is entirely acceptable.
leader lease gets acquired by old pod o2, as n1 hasn't finished syncing its cache
It still seems to assume that n1 will only attempt to acquire the lease after it has finished syncing the cache.
Warmup, healthcheck, and leader election need to be combined together in order to eliminate downtime during rollouts. If only warmup and healthcheck are used, the following situation can occur:
- new pod n1 is created, completes cache synchronization, and becomes ready.
- old pod o1 is removed and gives up the leader lease.
- new pod n2 is created, its cache has not yet finished syncing, and it starts to participate in leader election.
- the new pod n2 becomes the leader, but its cache has not yet finished syncing, leading to downtime.
2. Ideally, non-leader runnables should block readyz and healthz checks until they are in sync. I am | ||
not sure what the best way of implementing this is, because we would have to add a healthz check | ||
that blocks on WaitForSync for all the sources started as part of the non-leader runnables. | ||
3. An alternative way of implementing the above is to moving the source starting / management code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is kind-of already the case, the source uses the cache which is a runnable to get the actual informer and the cache is shared and started before anything else except conversion webhooks (as conversion webhooks might be needed to start it). The problem is just that the cache does not actually cache anything for resources for which no informer was requested and that in turn only happens after the source is started which happens post controller start and thus post leader election
designs/warmreplicas.md
Outdated
|
||
## Motivation | ||
Controllers reconcile all objects during startup / leader election failover to account for changes | ||
in the reconciliation logic. For certain sources, the time to serve the initial list can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious. Is it literally the list call that takes minutes or the subsequent reconciliation?
Does this change when the new ListWatch feature is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is purely about the time it takes to start reconciling after a replica went down because of a rollout or an unforseen event. Right now, that means we first acquire the leader lease, then sync all caches, then start reconciling. The goal of this doc is to do the second step before we even try to aquire the leader lease, as that will take the time it takes to sync the caches out of the transition time.
Agree the description could be a bit clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's fine. I was just curious about the list call :) (aka the cache sync)
downtime as even after leader election, the controller has to wait for the initial list to be served | ||
before it can start reconciling. | ||
|
||
## Proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there's a lot of churn and the controller doesn't become leader maybe not at all or maybe after a few days?
The queue length will increase while there is nothing that takes items out of the queue.
I know the queue doesn't require significant memory to store an item but is there something we should be concerned about if the queue has eg millions of items (let's say we watch pods and we don't become leader for a month)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worst case it at some point gets oom killed, restarts, does everything again, I don't think this is likely to become an actual issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah definitely not a big problem. Maybe just good to know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the queue have deduplication built in? I thought that you couldn't re-add a key to the queue if the key already existed within the queue.
In which case, when you perform the initial list, it adds every item to the queue, and then every object created after would be added to the queue, but the queue length would always be equal to the number of those objects in the cluster
We should double check if this is actually how they work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the queue have deduplication built in? I thought that you couldn't re-add a key to the queue if the key already existed within the queue.
Yes there is de-duplication. We are basically talking about a map :)
queue length would always be equal to the number of those objects in the cluster
I don't think that deleted objects are removed from our queue. So if you have churn (i.e. any objects are deleted) the queue length won't be equal to the number of objects in the cluster.
But the queue only stores namespace + name. So it doesn't require a lot of memory per object.
|
||
## Concerns/Questions | ||
1. Controllers opted into this feature will break the workqueue.depth metric as the controller will | ||
have a pre filled queue before it starts processing items. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually break the metric? Sounds like the metric will just show the reality
It might break alerts that assume the queue length should be pretty low, but that's an issue of the alerts.
Took another look. Definitely fine for now. I would suggest let's review/merge #3192 and then we can finalize this PR afterwards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a POC or any implementation that would show what the difference is for an end user? Sources generally are informers right? So, as someone implementing a controller, how would I start the informers in this case vs what I would have normally done?
A worked example of the usage would probably be good to include in the design doc
downtime as even after leader election, the controller has to wait for the initial list to be served | ||
before it can start reconciling. | ||
|
||
## Proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the queue have deduplication built in? I thought that you couldn't re-add a key to the queue if the key already existed within the queue.
In which case, when you perform the initial list, it adds every item to the queue, and then every object created after would be added to the queue, but the queue length would always be equal to the number of those objects in the cluster
We should double check if this is actually how they work
// when the manager is not the leader. | ||
// Defaults to false, which means that the controller will wait for leader election to start | ||
// before starting sources. | ||
ShouldWarmupWithoutLeadership *bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why *bool
over bool
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually like to use *bool so we can differentiate between default value and whatever a user configured.
This is also necessary as we have two levels of configuration ("manager" and per controller) and we have to be able to figure out if the per-controller config overrides the manager config
Has the impact to API servers been assessed for this proposal? IIUC, the slow to start sources are likely informers that are having to page through large volumes of resources, right? So there's already a higher than average load on the API server for these types of resources, and this EP is basically going to double the API server load? Am I following this correctly? |
Its an opt-in feature and in the vast majority of cases, the kube-apiserver is managed. The problem this solves is that failover is currently slow due to the fact that the watches are only started after leader election is won. |
Maybe additional apiserver load is worth mentioning in the godoc of the field that enables the feature?
The implementation can be found here: #3192 You only have to set I would recommend reviewing the PR as we are figuring out some details there at the moment (e.g. I raised the question if WarmupRunnable should be exported, xref: #3192 (comment), this would drastically reduce the API surface of this feature) |
Opt-in for the operator author, or opt-in for the end user who deploys that operator top their cluster? I suspect we might want to recommend that folks expose an option to end users to be able to turn this on/off on a cluster by cluster basis I guess that would be exposing the |
That is up to the operator author to decide. I personally don't think the duplicates watches are an issue for the apiserver, if that can make it fall over, its too small anyways. But generally speaking, making a call as to what to do or not to do by default and if an end-user option should be exposed or not is up to the controller author, I don't believe controller-runtime is in a good position to make a recommendation there. |
Usually we don't go as far as recommending command line flags in controller-runtime. That's maybe more kubebuilder territory. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
@godwinpang Can you please update the proposal according to the implementation so we can merge this one here as well? |
/remove-lifecycle stale |
|
||
## Concerns/Questions | ||
1. Controllers opted into this feature will break the workqueue.depth metric as the controller will | ||
have a pre filled queue before it starts processing items. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pre-filled queue can conflict with the priority queue feature. When the priority queue is enabled, a controller that has just restarted will initially be filled with low-priority items. However, while running as a non-leader, events for those items will reset their priority to 0. As a result, once this instance becomes the leader, the benefits of the priority queue are diminished and may even completely disappear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, while running as a non-leader, events for those items will reset their priority to 0
I assume because these objects are getting updated sooner or later and with a regular update event the priority will be increased to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @alvaroaleman (I think that is expected behavior as of now, but still worth thinking about)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a good point and I agree this is a problem, but it is not clear to me how we could solve it. Have a wrapper around the priority queue that reduces the priority of all adds to lowPriority until the current replica is leader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah maybe something like that. Or change all items in all queues to low priority when a replica becomes leader.
Your option is probably less racy.
I guess in general we would like to overwrite the priority with low priority even if a custom handler wanted to set a different priority?
This change describes the motivation and implementation details for supporting warm replicas in controller-runtime. I have floated this idea offline with @alvaroaleman to address some really slow sources that we work with that take 10s of minutes to serve the initial list.There is no open issue discussing it. Let me know if that is preferred and I can open one.
Previously discussed in #2005 and #2600