[BugFix] Fix redundant replica handling after clone #62542

hongkunxu · 2025-08-31T07:31:43Z

Why I'm doing:

Under the current rebalance + deletion mechanism, when a new BE node is added, data cannot be successfully migrated to it.

Here’s why:

During rebalance, a tablet replica is cloned from a source BE to the new BE.
At this moment, the source BE still holds the replica, so the system immediately treats the situation as redundant.
Since the source BE’s replica is already recorded in cachedReplicaId and scheduled for deletion, the cloned replica on the new BE is also incorrectly judged as redundant and then removed.

As a result, the new BE node never keeps the migrated data, making the rebalance ineffective.

What I'm doing:

I added an extra check in the redundant replica detection logic:

Before deciding whether a replica is redundant, we now verify if its tabletId already exists in cachedReplicaId.
If it does, we skip the redundant replica judgment because the replica is already scheduled for cleanup.

Fixes #62541

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

Signed-off-by: Hongkun Xu <xuhongkun666@163.com>

sonarqubecloud · 2025-08-31T07:38:56Z

Quality Gate passed

Issues
7 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2025-08-31T08:23:43Z

[Java-Extensions Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2025-08-31T08:24:43Z

[BE Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

wyb · 2025-09-01T05:33:17Z

fe/fe-core/src/main/java/com/starrocks/clone/TabletScheduler.java

@@ -1185,7 +1185,13 @@ private boolean deleteLocationMismatchReplica(TabletSchedCtx tabletCtx, boolean
                if (backend != null) {
                    Pair<String, String> location = backend.getSingleLevelLocationKV();
                    if (location != null && matchedLocations.contains(location)) {
-                        dupReplica = replica;


how about iterating replicas backwards and remove the replica with duplicate location?

Hi @wyb Iterating replicas backwards and removing the replica with duplicate location is indeed one solution, but in that case, wouldn’t this mean that

// tabletId -> replicaId // used to delete src replica after copy task success private final Map<Long, Long> cachedReplicaId = new ConcurrentHashMap<>();

becomes useless? This set operation only happens in completeSchedCtx inside createBalanceTask. And since rebalance will inevitably cause duplicate replicas, if we clean them up directly here, doesn’t that mean the subsequent deleteReplicaChosenByRebalancer can also be removed?

current implement doesn't guarantee that the source replica must be deleted via deleteReplicaChosenByRebalancer after balancing.

[BugFix] Fix redundant replica handling after clone

b902215

Signed-off-by: Hongkun Xu <xuhongkun666@163.com>

hongkunxu requested a review from a team as a code owner August 31, 2025 07:31

mergify bot assigned hongkunxu Aug 31, 2025

gengjun-git self-assigned this Sep 1, 2025

wyb reviewed Sep 1, 2025

View reviewed changes

hongkunxu requested a review from wyb September 1, 2025 06:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix] Fix redundant replica handling after clone #62542

[BugFix] Fix redundant replica handling after clone #62542

hongkunxu commented Aug 31, 2025

Uh oh!

sonarqubecloud bot commented Aug 31, 2025

Uh oh!

github-actions bot commented Aug 31, 2025

Uh oh!

github-actions bot commented Aug 31, 2025

Uh oh!

wyb Sep 1, 2025

Uh oh!

hongkunxu Sep 1, 2025

Uh oh!

wyb Sep 1, 2025

Uh oh!

Uh oh!

[BugFix] Fix redundant replica handling after clone #62542

Are you sure you want to change the base?

[BugFix] Fix redundant replica handling after clone #62542

Conversation

hongkunxu commented Aug 31, 2025

Why I'm doing:

What I'm doing:

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

Uh oh!

sonarqubecloud bot commented Aug 31, 2025

Quality Gate passed

Uh oh!

github-actions bot commented Aug 31, 2025

[Java-Extensions Incremental Coverage Report]

Uh oh!

github-actions bot commented Aug 31, 2025

[BE Incremental Coverage Report]

Uh oh!

wyb Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

hongkunxu Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

wyb Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!