📖 Add a design for supporting warm replicas. #3121

godwinpang · 2025-02-18T12:02:13Z

This change describes the motivation and implementation details for supporting warm replicas in controller-runtime. I have floated this idea offline with @alvaroaleman to address some really slow sources that we work with that take 10s of minutes to serve the initial list.There is no open issue discussing it. Let me know if that is preferred and I can open one.

Previously discussed in #2005 and #2600

k8s-ci-robot · 2025-02-18T12:02:22Z

Welcome @godwinpang!

It looks like this is your first PR to kubernetes-sigs/controller-runtime 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/controller-runtime has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-02-18T12:02:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: godwinpang
Once this PR has been reviewed and has the lgtm label, please assign vincepri for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-02-18T12:02:23Z

Hi @godwinpang. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

designs/warmreplicas.md

alvaroaleman · 2025-02-19T01:26:56Z

designs/warmreplicas.md

+
+## Concerns/Questions
+1. Controllers opted into this feature will break the workqueue.depth metric as the controller will
+   have a pre filled queue before it starts processing items.


As discussed offline, one way to avoid this is to use a metrics wrapper that supresses them until the leader election is won. But not sure if its worth bothering

Does this actually break the metric? Sounds like the metric will just show the reality

It might break alerts that assume the queue length should be pretty low, but that's an issue of the alerts.

It might break alerts that assume the queue length should be pretty low, but that's an issue of the alerts.

Not sure I quite agree with that. The alerts work today, if we change the behavior here, we break them. To my knowledge there also isn't a metric that indicates if a given replica is the leader, so I don't even see a good way to unbreak them

Yeah, but the current definition of the metric is that it should show the length of the queue

To my knowledge there also isn't a metric that indicates if a given replica is the leader

That could be solved, I hope :)

logs via logState

other workqueue metrics: adds_total, queue_duration_seconds

Although I guess we can also fake these. What would happen when the controller starts? I assume we would set the length metric immediately to it's correct value. Similar for adds_total and probably also queue_duration_seconds

I also think folks have programmatic access to the queue (at least by instantiating the queue in controller.Options.NewQueue)

So we don't know what kind of things folks are doing with the queue, e.g. accessing queue length or taking items out of the queue even if the leader election controllers are not running.

Then we maybe shouldn't store the items in the queue at this time because that's observable behavior as well (not only through the metric) and not just make it look like the queue is empty through the metric

That makes sense to me. I guess reading between the lines, the suggestion here is that non-leader runnables should The specific behavior I want to solve for is to have the queue ready to go asap after leader election, so would we implement something like "populate queue from cache on winning leader election"? I am not too familiar with the details here so would appreciate some concrete nudges in the right direction

That makes sense to me. I guess reading between the lines, the suggestion here is that non-leader runnables should The specific behavior I want to solve for is to have the queue ready to go asap after leader election, so would we implement something like "populate queue from cache on winning leader election"?

I don't know how that would be possible, we would need to implement a dummy queue to buffer.

Lets punt on this problem until we make this the default behavior, we don't have to solve it right away. For as long as its opt-in and there is a metric indicating if a given replica is a leader, it won't break anyone by default and people can adjust their alerts accordingly.

Sounds good!

Thought more about this. Probably the initial suggestion is fine. We should just make sure that the metrics popping up once the controller becomes leader make sense

alvaroaleman · 2025-02-19T01:27:43Z

designs/warmreplicas.md

+## Concerns/Questions
+1. Controllers opted into this feature will break the workqueue.depth metric as the controller will
+   have a pre filled queue before it starts processing items.
+2. Ideally, non-leader runnables should block readyz and healthz checks until they are in sync. I am


This will break conversion webhooks. I don't know if there is a good way to figure out if the binary contains a conversion webhook, but if in doubt we have to retain the current behavior

Can you explain why this breaks conversion webhooks? Taking a step back, why does controller-runtime offer a way to run conversion webhooks in the first place?

Can you explain why this breaks conversion webhooks?

Because they need to be up to sync the cache, so blocking readyz until the cache is ready creates a deadlock.

Taking a step back, why does controller-runtime offer a way to run conversion webhooks in the first place?

Because they are part of some controllers.

got it, we can just put the burden on the user to register readyz as they want then

Potentially we can provide a util func like we did with mgr.GetWebhookServer().StartedChecker() to make it easier (can be done in a follow-up PR)

But agree we can't add this per default as it can lead to deadlocks for controllers that serve conversion webhooks.

At least in #3192, this has not yet been fully implemented.
cm.runnables.Warmup.Start() launches a goroutine for each runnable to invoke runnable.Start() and then returns once all known runnables have been started. After that, the leaderElector is started. In other words, at present, there is no guaranteed ordering between leader election and the completion of cache synchronization.

The key point is that passing in the ready parameter is not currently implemented.

controller-runtime/pkg/manager/runnable_group.go

Line 83 in 9d3997b

if err := r.Warmup.Add(RunnableFunc(warmupRunnable.Warmup), nil); err != nil {

If the check only passes after the warmup is completed, and before that it blocks cm.runnables.Warmup.Start(), then the idea of blocking leader election would be feasible.

controller-runtime/pkg/manager/runnable_group.go

Lines 246 to 250 in 9d3997b

if rn.Check(r.ctx) {

if rn.signalReady {

r.startReadyCh <- rn

}

}

controller-runtime/pkg/manager/runnable_group.go

Lines 190 to 210 in 9d3997b

// Wait for all runnables to signal.

for {

select {

case <-ctx.Done():

if err := ctx.Err(); !errors.Is(err, context.Canceled) {

retErr = err

}

case rn := <-r.startReadyCh:

for i, existing := range r.startQueue {

if existing == rn {

// Remove the item from the start queue.

r.startQueue = append(r.startQueue[:i], r.startQueue[i+1:]...)

break

}

}

// We're done waiting if the queue is empty, return.

if len(r.startQueue) == 0 {

return

}

}

}

Got it, I think you're right

After that, the leaderElector is started. In other words, at present, there is no guaranteed ordering between leader election and the completion of cache synchronization.

Yes, this is correct but I think your suggestion of a readyness check means that we won't even start non-leader election runnables before the warmups have finished, which I think is undesirable. Not starting leader-election until it has finished seems fine, but might slow down things in case there was no passive replica, because acquiring the leader lease takes a bit of time.

The other thing this makes me realize is that in order to have rollouts where the cache sync is not part of the downtime we would need to integrate the warmup with a healthcheck, such that the healthcheck only passes once warmup has completed, otherwise what could happen is:

new pod n1 gets created

Old pod o1 gets removed and gives up the leader lease

leader lease gets acquired by old pod o2, as n1 hasn't finished syncing its cache

new pod n2 gets created

old pod o2 gets shutdown - If neither n1 or n2 hasn't synced yet, we will again have some cache sync induced downtime

The issue then however is that delaying the healthcheck until syncing finished poses a problem for anyone who uses conversion webhooks, as conversion webhooks may be required to be able to sync, causing a deadlock.

We could probably add a second knob to control if the warmup should be part of the healthcheck and then this can be used by anyone who doesn't have conversion webhooks?

Not starting leader-election until it has finished seems fine, but might slow down things in case there was no passive replica, because acquiring the leader lease takes a bit of time.

I’m describing this requirement because, in our scenario, we need to strictly control downtime, and synchronizing the cache usually takes several minutes. Therefore, a delay of just a few seconds caused by leader election is entirely acceptable.

leader lease gets acquired by old pod o2, as n1 hasn't finished syncing its cache

It still seems to assume that n1 will only attempt to acquire the lease after it has finished syncing the cache.

Warmup, healthcheck, and leader election need to be combined together in order to eliminate downtime during rollouts. If only warmup and healthcheck are used, the following situation can occur:

new pod n1 is created, completes cache synchronization, and becomes ready.

old pod o1 is removed and gives up the leader lease.

new pod n2 is created, its cache has not yet finished syncing, and it starts to participate in leader election.

the new pod n2 becomes the leader, but its cache has not yet finished syncing, leading to downtime.

alvaroaleman · 2025-02-19T01:30:08Z

designs/warmreplicas.md

+2. Ideally, non-leader runnables should block readyz and healthz checks until they are in sync. I am
+   not sure what the best way of implementing this is, because we would have to add a healthz check
+   that blocks on WaitForSync for all the sources started as part of the non-leader runnables.
+3. An alternative way of implementing the above is to moving the source starting / management code


That is kind-of already the case, the source uses the cache which is a runnable to get the actual informer and the cache is shared and started before anything else except conversion webhooks (as conversion webhooks might be needed to start it). The problem is just that the cache does not actually cache anything for resources for which no informer was requested and that in turn only happens after the source is started which happens post controller start and thus post leader election

sbueringer · 2025-02-19T07:59:56Z

designs/warmreplicas.md

+
+## Motivation
+Controllers reconcile all objects during startup / leader election failover to account for changes
+in the reconciliation logic. For certain sources, the time to serve the initial list can be


Just curious. Is it literally the list call that takes minutes or the subsequent reconciliation?

Does this change when the new ListWatch feature is used?

This is purely about the time it takes to start reconciling after a replica went down because of a rollout or an unforseen event. Right now, that means we first acquire the leader lease, then sync all caches, then start reconciling. The goal of this doc is to do the second step before we even try to aquire the leader lease, as that will take the time it takes to sync the caches out of the transition time.
Agree the description could be a bit clearer.

Yeah that's fine. I was just curious about the list call :) (aka the cache sync)

sbueringer · 2025-02-19T08:04:44Z

designs/warmreplicas.md

+downtime as even after leader election, the controller has to wait for the initial list to be served
+before it can start reconciling.
+
+## Proposal


What if there's a lot of churn and the controller doesn't become leader maybe not at all or maybe after a few days?

The queue length will increase while there is nothing that takes items out of the queue.

I know the queue doesn't require significant memory to store an item but is there something we should be concerned about if the queue has eg millions of items (let's say we watch pods and we don't become leader for a month)

Worst case it at some point gets oom killed, restarts, does everything again, I don't think this is likely to become an actual issue

Yeah definitely not a big problem. Maybe just good to know

Doesn't the queue have deduplication built in? I thought that you couldn't re-add a key to the queue if the key already existed within the queue.

In which case, when you perform the initial list, it adds every item to the queue, and then every object created after would be added to the queue, but the queue length would always be equal to the number of those objects in the cluster

We should double check if this is actually how they work

Doesn't the queue have deduplication built in? I thought that you couldn't re-add a key to the queue if the key already existed within the queue.

Yes there is de-duplication. We are basically talking about a map :)

queue length would always be equal to the number of those objects in the cluster

I don't think that deleted objects are removed from our queue. So if you have churn (i.e. any objects are deleted) the queue length won't be equal to the number of objects in the cluster.

But the queue only stores namespace + name. So it doesn't require a lot of memory per object.

designs/warmreplicas.md

sbueringer · 2025-02-19T08:10:58Z

designs/warmreplicas.md

+
+## Concerns/Questions
+1. Controllers opted into this feature will break the workqueue.depth metric as the controller will
+   have a pre filled queue before it starts processing items.


Does this actually break the metric? Sounds like the metric will just show the reality

It might break alerts that assume the queue length should be pretty low, but that's an issue of the alerts.

sbueringer · 2025-05-12T09:35:09Z

Took another look. Definitely fine for now.

I would suggest let's review/merge #3192 and then we can finalize this PR afterwards

JoelSpeed

Do you have a POC or any implementation that would show what the difference is for an end user? Sources generally are informers right? So, as someone implementing a controller, how would I start the informers in this case vs what I would have normally done?

A worked example of the usage would probably be good to include in the design doc

JoelSpeed · 2025-05-12T12:57:13Z

designs/warmreplicas.md

+downtime as even after leader election, the controller has to wait for the initial list to be served
+before it can start reconciling.
+
+## Proposal


Doesn't the queue have deduplication built in? I thought that you couldn't re-add a key to the queue if the key already existed within the queue.

In which case, when you perform the initial list, it adds every item to the queue, and then every object created after would be added to the queue, but the queue length would always be equal to the number of those objects in the cluster

We should double check if this is actually how they work

JoelSpeed · 2025-05-12T12:58:33Z

designs/warmreplicas.md

+    // when the manager is not the leader.
+    // Defaults to false, which means that the controller will wait for leader election to start
+    // before starting sources.
+    ShouldWarmupWithoutLeadership *bool


Why *bool over bool?

We usually like to use *bool so we can differentiate between default value and whatever a user configured.

This is also necessary as we have two levels of configuration ("manager" and per controller) and we have to be able to figure out if the per-controller config overrides the manager config

JoelSpeed · 2025-05-12T13:06:08Z

Has the impact to API servers been assessed for this proposal? IIUC, the slow to start sources are likely informers that are having to page through large volumes of resources, right? So there's already a higher than average load on the API server for these types of resources, and this EP is basically going to double the API server load? Am I following this correctly?

alvaroaleman · 2025-05-12T13:41:06Z

Has the impact to API servers been assessed for this proposal? IIUC, the slow to start sources are likely informers that are having to page through large volumes of resources, right? So there's already a higher than average load on the API server for these types of resources, and this EP is basically going to double the API server load? Am I following this correctly?

Its an opt-in feature and in the vast majority of cases, the kube-apiserver is managed. The problem this solves is that failover is currently slow due to the fact that the watches are only started after leader election is won.

sbueringer · 2025-05-12T13:49:38Z

Maybe additional apiserver load is worth mentioning in the godoc of the field that enables the feature?

Do you have a POC or any implementation that would show what the difference is for an end user? Sources generally are informers right? So, as someone implementing a controller, how would I start the informers in this case vs what I would have normally done?

The implementation can be found here: #3192

You only have to set NeedWarmup on the manager or per-controller and everything else just works out-of-the-box.

I would recommend reviewing the PR as we are figuring out some details there at the moment (e.g. I raised the question if WarmupRunnable should be exported, xref: #3192 (comment), this would drastically reduce the API surface of this feature)

JoelSpeed · 2025-05-12T15:26:15Z

Its an opt-in feature and in the vast majority of cases, the kube-apiserver is managed. The problem this solves is that failover is currently slow due to the fact that the watches are only started after leader election is won.

Opt-in for the operator author, or opt-in for the end user who deploys that operator top their cluster? I suspect we might want to recommend that folks expose an option to end users to be able to turn this on/off on a cluster by cluster basis

I guess that would be exposing the NeedWarmup flag to the end user right?

alvaroaleman · 2025-05-12T15:31:47Z

Opt-in for the operator author, or opt-in for the end user who deploys that operator top their cluster? I suspect we might want to recommend that folks expose an option to end users to be able to turn this on/off on a cluster by cluster basis

That is up to the operator author to decide. I personally don't think the duplicates watches are an issue for the apiserver, if that can make it fall over, its too small anyways. But generally speaking, making a call as to what to do or not to do by default and if an end-user option should be exposed or not is up to the controller author, I don't believe controller-runtime is in a good position to make a recommendation there.

sbueringer · 2025-05-12T15:48:13Z

Usually we don't go as far as recommending command line flags in controller-runtime. That's maybe more kubebuilder territory.

k8s-triage-robot · 2025-08-10T15:52:47Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sbueringer · 2025-08-11T17:23:58Z

@godwinpang Can you please update the proposal according to the implementation so we can merge this one here as well?

sbueringer · 2025-08-11T17:24:08Z

/remove-lifecycle stale

zach593 · 2025-08-22T03:17:40Z

designs/warmreplicas.md

+
+## Concerns/Questions
+1. Controllers opted into this feature will break the workqueue.depth metric as the controller will
+   have a pre filled queue before it starts processing items.


The pre-filled queue can conflict with the priority queue feature. When the priority queue is enabled, a controller that has just restarted will initially be filled with low-priority items. However, while running as a non-leader, events for those items will reset their priority to 0. As a result, once this instance becomes the leader, the benefits of the priority queue are diminished and may even completely disappear.

However, while running as a non-leader, events for those items will reset their priority to 0

I assume because these objects are getting updated sooner or later and with a regular update event the priority will be increased to 0?

Yes, that's right.

cc @alvaroaleman (I think that is expected behavior as of now, but still worth thinking about)

It is a good point and I agree this is a problem, but it is not clear to me how we could solve it. Have a wrapper around the priority queue that reduces the priority of all adds to lowPriority until the current replica is leader?

Yeah maybe something like that. Or change all items in all queues to low priority when a replica becomes leader.

Your option is probably less racy.

I guess in general we would like to overwrite the priority with low priority even if a custom handler wanted to set a different priority?

📖 Add a design for supporting warm replicas.

002e3d8

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 18, 2025

k8s-ci-robot requested review from joelanford and vincepri February 18, 2025 12:02

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 18, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 18, 2025

alvaroaleman reviewed Feb 19, 2025

View reviewed changes

sbueringer reviewed Feb 19, 2025

View reviewed changes

godwinpang mentioned this pull request Apr 5, 2025

✨ [Warm Replicas] Extract startWatches into helper method. #3190

Merged

Address feedback.

9d96060

godwinpang mentioned this pull request Apr 9, 2025

✨ Implement warm replica support for controllers #3192

Merged

JoelSpeed reviewed May 12, 2025

View reviewed changes

sbueringer mentioned this pull request May 16, 2025

Feature: Warmup for controllers #3220

Open

6 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2025

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 11, 2025

zach593 reviewed Aug 22, 2025

View reviewed changes

	if rn.Check(r.ctx) {
	if rn.signalReady {
	r.startReadyCh <- rn
	}
	}

	// Wait for all runnables to signal.
	for {
	select {
	case <-ctx.Done():
	if err := ctx.Err(); !errors.Is(err, context.Canceled) {
	retErr = err
	}
	case rn := <-r.startReadyCh:
	for i, existing := range r.startQueue {
	if existing == rn {
	// Remove the item from the start queue.
	r.startQueue = append(r.startQueue[:i], r.startQueue[i+1:]...)
	break
	}
	}
	// We're done waiting if the queue is empty, return.
	if len(r.startQueue) == 0 {
	return
	}
	}
	}

📖 Add a design for supporting warm replicas. #3121

Are you sure you want to change the base?

📖 Add a design for supporting warm replicas. #3121

Conversation

godwinpang commented Feb 18, 2025

Uh oh!

k8s-ci-robot commented Feb 18, 2025

Uh oh!

k8s-ci-robot commented Feb 18, 2025

Uh oh!

k8s-ci-robot commented Feb 18, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zach593 Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbueringer Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbueringer May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

zach593 Aug 22, 2025 •

edited

Loading

sbueringer Feb 19, 2025 •

edited

Loading

sbueringer May 12, 2025 •

edited

Loading

sbueringer commented May 12, 2025 •

edited

Loading

alvaroaleman commented May 12, 2025 •

edited

Loading