-
Notifications
You must be signed in to change notification settings - Fork 3.6k
feat: Adds detection for orphaned TSM files while planning compaction #26766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master-1.x
Are you sure you want to change the base?
Conversation
We were seeing a state in which a shard would not perform full compactions leading to a build up of level 4 TSM files. File state: ``` -rw-r--r--. 1 root root 2.1G Aug 5 20:59 000016684-000000007.tsm -rw-r--r--. 1 root root 2.1G Aug 5 21:02 000016684-000000008.tsm -rw-r--r--. 1 root root 2.1G Aug 5 21:04 000016684-000000009.tsm -rw-r--r--. 1 root root 376M Aug 5 21:05 000016684-000000010.tsm -rw-r--r--. 1 root root 2.1G Aug 5 18:00 000016812-000000004.tsm -rw-r--r--. 1 root root 1.4G Aug 5 18:00 000016812-000000005.tsm -rw-r--r--. 1 root root 1.3G Aug 5 21:21 000016844-000000002.tsm -rw-r--r--. 1 root root 2.1G Aug 5 18:00 000016948-000000004.tsm -rw-r--r--. 1 root root 1.4G Aug 5 18:00 000016948-000000005.tsm -rw-r--r--. 1 root root 2.1G Aug 5 18:00 000017076-000000004.tsm ``` There is a rouge level 2 file packed within fully compacted files ``` -rw-r--r--. 1 root root 2.1G Aug 5 20:59 000016684-000000007.tsm -rw-r--r--. 1 root root 2.1G Aug 5 21:02 000016684-000000008.tsm -rw-r--r--. 1 root root 2.1G Aug 5 21:04 000016684-000000009.tsm -rw-r--r--. 1 root root 376M Aug 5 21:05 000016684-000000010.tsm ``` and level 4 files ``` -rw-r--r--. 1 root root 2.1G Aug 5 18:00 000016948-000000004.tsm -rw-r--r--. 1 root root 1.4G Aug 5 18:00 000016948-000000005.tsm -rw-r--r--. 1 root root 2.1G Aug 5 18:00 000017076-000000004.tsm ``` The area of our code that would cause this state to be skipped would be here https://github.com/influxdata/influxdb/blob/22bec4f046a28e3f1fa815705362767151407e1b/tsdb/engine/tsm1/compact.go#L620-L670 We need to add some sort of escape mechanism that would allow for compactions to occur or simplify this logic. This PR adds a test for the bugged file listing where we experience the issue. I'm still trying to find the way files get in to this state, but, this should resolve the issue if and when it occurs. I've added the check in PlanOptimize() as to not change any of the core logic in Plan().
The test fails without the second condition check? |
Correct, I should have pushed up a commit with just the test but if you remove the check it fails. |
Great!
…On Fri, Aug 29, 2025, 6:50 PM WeblWabl ***@***.***> wrote:
*devanbenz* left a comment (influxdata/influxdb#26766)
<#26766 (comment)>
The test fails without the second condition check? || cur.Len() < 2
Correct, I should have pushed up a commit with just the test but if you
remove the check it fails.
—
Reply to this email directly, view it on GitHub
<#26766 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARIQHJAA3H5YUAJXYFTWWMT3QD7PJAVCNFSM6AAAAACFF6HDZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEMZYHA2DSNBXGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
* 2 level 2 files * 3 level 2 files * 4 level 2 files * 5 level 2 files
…k at the files on disk a lot of these l4 files are not at the max points per block, just at the max file size of 2.1 GB.
…we don't have nested lower level files in our plans
tsdb/engine/tsm1/compact.go
Outdated
@@ -640,7 +642,7 @@ func (c *DefaultPlanner) Plan(lastWrite time.Time) ([]CompactionGroup, int64) { | |||
|
|||
// Skip compacting this group if there happens to be any lower level files in the | |||
// middle. These will get picked up by the level compactors. | |||
if lvl <= 3 { | |||
if lvl <= 3 && len(gen.files) > 4 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it would be better to ensure that the files are nested within high level files. For example, lets say we have the following levels listed
5
6
7
8
4
5
2
2
4
5
4
5
We can assume that the 2 level 2 files need to be compacted as a group with our high level files.
But if we have
5
6
7
8
4
5
2
2
since the 2 level 2 files are at the end of the entire list of files we should not compact them with our higher groups
I found the following test case:
data := []tsm1.FileStat{
{
Path: "01-04.tsm1",
Size: 64 * 1024 * 1024,
},
{
Path: "02-04.tsm1",
Size: 64 * 1024 * 1024,
},
{
Path: "03-04.tsm1",
Size: 64 * 1024 * 1024,
},
{
Path: "04-04.tsm1",
Size: 64 * 1024 * 1024,
},
{
Path: "05-02.tsm1",
Size: 2 * 1024 * 1024,
},
}
Which is found here:
influxdb/tsdb/engine/tsm1/compact_test.go
Lines 2593 to 2635 in b647892
func TestDefaultPlanner_Plan_CompactsMiddleSteps(t *testing.T) { | |
data := []tsm1.FileStat{ | |
{ | |
Path: "01-04.tsm1", | |
Size: 64 * 1024 * 1024, | |
}, | |
{ | |
Path: "02-04.tsm1", | |
Size: 64 * 1024 * 1024, | |
}, | |
{ | |
Path: "03-04.tsm1", | |
Size: 64 * 1024 * 1024, | |
}, | |
{ | |
Path: "04-04.tsm1", | |
Size: 64 * 1024 * 1024, | |
}, | |
{ | |
Path: "05-02.tsm1", | |
Size: 2 * 1024 * 1024, | |
}, | |
} | |
cp := tsm1.NewDefaultPlanner( | |
newFakeFileStore(withFileStats(data)), | |
tsdb.DefaultCompactFullWriteColdDuration, | |
) | |
expFiles := []tsm1.FileStat{data[0], data[1], data[2], data[3]} | |
tsm, pLen := cp.Plan(time.Now()) | |
if exp, got := len(expFiles), len(tsm[0]); got != exp { | |
t.Fatalf("tsm file length mismatch: got %v, exp %v", got, exp) | |
} else if pLen != int64(len(tsm)) { | |
t.Fatalf("tsm file plan length mismatch: got %v, exp %v", pLen, exp) | |
} | |
for i, p := range expFiles { | |
if got, exp := tsm[0][i], p.Path; got != exp { | |
t.Fatalf("tsm file mismatch: got %v, exp %v", got, exp) | |
} | |
} | |
} |
With this change does not pass. I would expect for the first four files to get compacted but the last one to be left out since its at the end of the file list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts, @gwossum @davidby-influx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pick up the level 2 files at the end? They will get the benefits of the better compression from inclusion with higher level files. Are you concerned it will be extra work on a hot shard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retracting the above comment after discussion.
with higher level files if it is nested
@devanbenz - I think we need to create a new test case file. In it, we should have many, many permutations of compaction levels and generations. Here are a few examples:
We want to test
Create |
Claude generated the test cases and I modified the results as to what I would expect.
I've found the static test file to work better this AM while working on this. I'm going to add the same tests duplicated but using different block counts to see how it impacts the planner + expected results. |
complex_nested_4_5_2_2_1_2_4_5_different_gens complex_nested_4_5_2_2_1_2_4_5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am slightly confused by the changes. It may be my problem.
tsdb/engine/tsm1/compact.go
Outdated
@@ -640,7 +653,7 @@ func (c *DefaultPlanner) Plan(lastWrite time.Time) ([]CompactionGroup, int64) { | |||
|
|||
// Skip compacting this group if there happens to be any lower level files in the | |||
// middle. These will get picked up by the level compactors. | |||
if lvl <= 3 { | |||
if lvl <= 3 && len(gen.files) > 4 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the length check here. Why are big groups of lower level files skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They will be picked up by level compaction as show by a few of the tests, for example:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor perf comment and one comment that needs to be addressed (or I'm confused) -- Either there should be a test that exercises the change or the condition should be removed (I think it can be removed).
tsdb/engine/tsm1/compact.go
Outdated
for i, g := range generations { | ||
if g.level() <= 3 { | ||
// Track the last high-level generation | ||
if g.level() > 3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
level()
calls parsefilename -- I think we should store off the lvl since you're calling it twice now and we know we spend a lot of cycles on parsing tsm file names (to be improved more substantially in other work).
lvl := g.level()
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
tsdb/engine/tsm1/compact.go
Outdated
if lvl <= 3 { | ||
// Skip compacting this group if there happens to be more than 3 lower level | ||
// files in the middle. These will get picked up by the level compactors. | ||
if lvl <= 3 && len(gen.files) > 3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this check. The loop over generations
(line 585) breaks when a generation of level 3 or less with 4 or more files is encountered so the remaining generations in consideration for compaction will never contain a group that matches lvl <= 3 && len(gen.files) > 3
as that's the same as lvl <= 3 && len(g.files) >= 4
above.
The tests in compact_test.go
with coverage show that the inner part of this if isn't ever run. So the code is confusing now or at least just as confusing as before; as the current master-1.x
code can not hit the inner part of this if block either as the loop over generations broke when any lvl <= 3
was hit.
I recommend adding a test that proves this if-chunk is necessary or remove it.
As an aside, the if check here and the comment imply to me that this compaction code used to be resilient against "orphaned" low level tsm files surrounded by fully compacted files, but looking at the git history that doesn't appear to be the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, another test to hit this conditional would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, commenting out the check on line 655 does cause test failures:
// Skip compacting this group if there happens to be more than 3 lower level
// files in the middle. These will get picked up by the level compactors.
if lvl <= 3 { // && len(gen.files) > 3 {
skipGroup = true
=== RUN TestEnginePlanCompactions/nested_4_5_2_2_4_5
compact_test.go:4528:
Error Trace: /home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4528
/home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4515
Error: "[]" should have 1 item(s), but has 0
Test: TestEnginePlanCompactions/nested_4_5_2_2_4_5
Messages: unexpected level 4 Group collection length mismatch
--- FAIL: TestEnginePlanCompactions/nested_4_5_2_2_4_5 (0.00s)
=== RUN TestEnginePlanCompactions/complex_nested_4_5_2_4_5_2_4
compact_test.go:4528:
Error Trace: /home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4528
/home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4515
Error: "[]" should have 1 item(s), but has 0
Test: TestEnginePlanCompactions/complex_nested_4_5_2_4_5_2_4
Messages: unexpected level 4 Group collection length mismatch
--- FAIL: TestEnginePlanCompactions/complex_nested_4_5_2_4_5_2_4 (0.00s)
=== RUN TestEnginePlanCompactions/complex_nested_4_5_2_2_1_2_4_5
compact_test.go:4528:
Error Trace: /home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4528
/home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4515
Error: "[]" should have 1 item(s), but has 0
Test: TestEnginePlanCompactions/complex_nested_4_5_2_2_1_2_4_5
Messages: unexpected level 4 Group collection length mismatch
--- FAIL: TestEnginePlanCompactions/complex_nested_4_5_2_2_1_2_4_5 (0.00s)
~/src/github.com/influxdata/influxdb/tsdb/engine/tsm1$ git diff
diff --git a/tsdb/engine/tsm1/compact.go b/tsdb/engine/tsm1/compact.go
index a7c318c54f..8de97a3fc5 100644
--- a/tsdb/engine/tsm1/compact.go
+++ b/tsdb/engine/tsm1/compact.go
@@ -652,7 +652,7 @@ func (c *DefaultPlanner) Plan(lastWrite time.Time) ([]CompactionGroup, int64) {
// Skip compacting this group if there happens to be more than 3 lower level
// files in the middle. These will get picked up by the level compactors.
- if lvl <= 3 && len(gen.files) > 3 {
+ if lvl <= 3 { // && len(gen.files) > 3 {
skipGroup = true
break
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And commenting out the other check on line 592 also causes errors:
// Skip low-level generations with too many files for level compaction
if g.level() <= 3 { // && len(g.files) >= 4 {
break
}
}
=== RUN TestEnginePlanCompactions/Mixed_generations_with_varying_file_sizes
compact_test.go:4528:
Error Trace: /home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4528
/home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4515
Error: "[]" should have 1 item(s), but has 0
Test: TestEnginePlanCompactions/Mixed_generations_with_varying_file_sizes
Messages: unexpected level 4 Group collection length mismatch
--- FAIL: TestEnginePlanCompactions/Mixed_generations_with_varying_file_sizes (0.00s)
=== RUN TestEnginePlanCompactions/Mixed_generations_with_2_level_2_files
compact_test.go:4528:
Error Trace: /home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4528
/home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4515
Error: "[]" should have 1 item(s), but has 0
Test: TestEnginePlanCompactions/Mixed_generations_with_2_level_2_files
Messages: unexpected level 4 Group collection length mismatch
--- FAIL: TestEnginePlanCompactions/Mixed_generations_with_2_level_2_files (0.00s)
=== RUN TestEnginePlanCompactions/Mixed_generations_with_3_level_2_files
compact_test.go:4528:
Error Trace: /home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4528
/home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4515
Error: "[]" should have 1 item(s), but has 0
Test: TestEnginePlanCompactions/Mixed_generations_with_3_level_2_files
Messages: unexpected level 4 Group collection length mismatch
--- FAIL: TestEnginePlanCompactions/Mixed_generations_with_3_level_2_files (0.00s)
=== RUN TestEnginePlanCompactions/Mixed_generations_with_4_level_2_files
--- PASS: TestEnginePlanCompactions/Mixed_generations_with_4_level_2_files (0.00s)
=== RUN TestEnginePlanCompactions/Mixed_generations_with_5_level_2_files
compact_test.go:4528:
Error Trace: /home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4528
/home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4515
Error: "[]" should have 1 item(s), but has 0
Test: TestEnginePlanCompactions/Mixed_generations_with_5_level_2_files
Messages: unexpected level 4 Group collection length mismatch
--- FAIL: TestEnginePlanCompactions/Mixed_generations_with_5_level_2_files (0.00s)
=== RUN TestEnginePlanCompactions/First_file_is_lower_level_than_next_files
compact_test.go:4528:
Error Trace: /home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4528
/home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4515
Error: "[]" should have 1 item(s), but has 0
Test: TestEnginePlanCompactions/First_file_is_lower_level_than_next_files
Messages: unexpected level 4 Group collection length mismatch
--- FAIL: TestEnginePlanCompactions/First_file_is_lower_level_than_next_files (0.00s)
=== RUN TestEnginePlanCompactions/basic_high_level_4_5_4_5_4
--- PASS: TestEnginePlanCompactions/basic_high_level_4_5_4_5_4 (0.00s)
=== RUN TestEnginePlanCompactions/leading_low_2_4_5_4_5_4
compact_test.go:4528:
Error Trace: /home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4528
/home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4515
Error: "[]" should have 1 item(s), but has 0
Test: TestEnginePlanCompactions/leading_low_2_4_5_4_5_4
Messages: unexpected level 4 Group collection length mismatch
--- FAIL: TestEnginePlanCompactions/leading_low_2_4_5_4_5_4 (0.00s)
=== RUN TestEnginePlanCompactions/leading_low_run_2_2_4_5_4_5_4
compact_test.go:4528:
Error Trace: /home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4528
/home/davidby/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact_test.go:4515
Error: "[]" should have 1 item(s), but has 0
Test: TestEnginePlanCompactions/leading_low_run_2_2_4_5_4_5_4
Messages: unexpected level 4 Group collection length mismatch
--- FAIL: TestEnginePlanCompactions/leading_low_run_2_2_4_5_4_5_4 (0.00s)
~/src/github.com/influxdata/influxdb/tsdb/engine/tsm1$ git diff
diff --git a/tsdb/engine/tsm1/compact.go b/tsdb/engine/tsm1/compact.go
index a7c318c54f..5369924181 100644
--- a/tsdb/engine/tsm1/compact.go
+++ b/tsdb/engine/tsm1/compact.go
@@ -589,7 +589,7 @@ func (c *DefaultPlanner) Plan(lastWrite time.Time) ([]CompactionGroup, int64) {
}
// Skip low-level generations with too many files for level compaction
- if g.level() <= 3 && len(g.files) >= 4 {
+ if g.level() <= 3 { // && len(g.files) >= 4 {
break
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you only commented out half of the conditional. I'm saying the full conditional is never true in any test.
I get no test failures when I comment the 655 conditional:
diff --git a/tsdb/engine/tsm1/compact.go b/tsdb/engine/tsm1/compact.go
index a7c318c54f..2a0c02c6c0 100644
--- a/tsdb/engine/tsm1/compact.go
+++ b/tsdb/engine/tsm1/compact.go
@@ -648,14 +648,14 @@ func (c *DefaultPlanner) Plan(lastWrite time.Time) ([]CompactionGroup, int64) {
for j := i; j < i+step && j < len(generations); j++ {
gen := generations[j]
- lvl := gen.level()
+ //lvl := gen.level()
// Skip compacting this group if there happens to be more than 3 lower level
// files in the middle. These will get picked up by the level compactors.
- if lvl <= 3 && len(gen.files) > 3 {
- skipGroup = true
- break
- }
+ //if lvl <= 3 && len(gen.files) > 3 {
+ // skipGroup = true
+ // break
+ //}
// Skip the file if it's over the max size and it contains a full block
if gen.size() >= uint64(tsdb.MaxTSMFileSize) && gen.files[0].FirstBlockCount >= tsdb.DefaultMaxPointsPerBlock && !gen.hasTombstones() {
and to prove the point, if I instead comment out the conditional at 592 I also get no test failures.
diff --git a/tsdb/engine/tsm1/compact.go b/tsdb/engine/tsm1/compact.go
index a7c318c54f..d89b4c2cee 100644
--- a/tsdb/engine/tsm1/compact.go
+++ b/tsdb/engine/tsm1/compact.go
@@ -589,9 +589,9 @@ func (c *DefaultPlanner) Plan(lastWrite time.Time) ([]CompactionGroup, int64) {
}
// Skip low-level generations with too many files for level compaction
- if g.level() <= 3 && len(g.files) >= 4 {
- break
- }
+ //if g.level() <= 3 && len(g.files) >= 4 {
+ // break
+ //}
}
// If we have high-level files, only include generations up to the last high-level file
And way more interesting, if I comment out BOTH conditionals, I still get no test failures.
I expected the first two because I think they cover the same situations but removing both and not seeing any test failures means a gap in the testing of the changes in the pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@philjb It looks like by commenting out
diff --git a/tsdb/engine/tsm1/compact.go b/tsdb/engine/tsm1/compact.go
index a7c318c54f..d89b4c2cee 100644
--- a/tsdb/engine/tsm1/compact.go
+++ b/tsdb/engine/tsm1/compact.go
@@ -589,9 +589,9 @@ func (c *DefaultPlanner) Plan(lastWrite time.Time) ([]CompactionGroup, int64) {
}
// Skip low-level generations with too many files for level compaction
- if g.level() <= 3 && len(g.files) >= 4 {
- break
- }
+ //if g.level() <= 3 && len(g.files) >= 4 {
+ // break
+ //}
}
// If we have high-level files, only include generations up to the last high-level file
The tests still pass as it's a bit redundant since later on we will skip these groups anyway:
influxdb/tsdb/engine/tsm1/compact.go
Lines 637 to 653 in 27b8311
for j := i; j < i+step && j < len(generations); j++ { | |
gen := generations[j] | |
lvl := gen.level() | |
// Skip compacting this group if there happens to be any lower level files in the | |
// middle. These will get picked up by the level compactors. | |
if lvl <= 3 { | |
skipGroup = true | |
break | |
} | |
// Skip the file if it's over the max size and it contains a full block | |
if gen.size() >= uint64(tsdb.MaxTSMFileSize) && gen.files[0].FirstBlockCount >= tsdb.DefaultMaxPointsPerBlock && !gen.hasTombstones() { | |
startIndex++ | |
continue | |
} | |
} |
I wonder if this check can be removed to simplify 🤔
I'm tracing through the rest of the code to see if we can consolidate/simplify things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@philjb - I commented out the part of the conditional that was added in the PR, leaving the previous code. That's why it was half of each removed.
Related to commit: 06226d6 |
Adds tests to check for higher sized lower levels in compaction
}, | ||
}, | ||
{ | ||
name: "Mixed generations with varying file sizes, small amount of files", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test and all proceeding tests are for checking correctness in the edge case that Phil pointed out. I have this checking against a single lower level TSM file, 3 lower level TSM files (< 4) and 5 lower level TSM files (> 4). All tests appear to plan the nested lower level files for full compaction as expected.
} | ||
|
||
// If we have high-level files, only include generations up to the last high-level file | ||
if c.enableNestedCompactor && lastHighLevelIndex >= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be needed since we set lastHighLevelIndex to -1 so if the feature flag is disabled it will just always be -1.. I want to remain consistent and indicate that this code is only supposed to be enabled when the feature flag is on so I decided to keep it.
start = i | ||
break | ||
if c.enableNestedCompactor { | ||
if g.size()*2 < generations[i-1].size() && generations[i-1].level() >= generations[i].level() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to add a comment about why this check needs to be different with the feature flag on. It's a bit confusing to just look at without knowing context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment would be great.
I still think that this logic will "leave behind" a non fully compacted file, but i think this is likely ok. For example
gen-lvl | size | index |
---|---|---|
6-6 | 6.1 | 0 |
5-5 | 5.1 | 1 |
4-4 | 4.1 | 2 |
3-2 | 2 | 3 |
2-2 | .9 | 4 |
1-7 | .9 | 5 |
0-1 | .9 | 6 |
start will be 4, end will be 6. The 3-2 file won't be picked up.
@@ -252,6 +252,7 @@ func NewEngine(id uint64, idx tsdb.Index, path string, walPath string, sfile *ts | |||
|
|||
var planner CompactionPlanner = NewDefaultPlanner(fs, time.Duration(opt.Config.CompactFullWriteColdDuration)) | |||
planner.SetAggressiveCompactionPointsPerBlock(int(opt.Config.AggressivePointsPerBlock)) | |||
planner.SetNestedCompactor(opt.Config.NestedCompactorEnabled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps log here if the config is set to true. I think that will be helpful confirmation. Sometimes we get logs but not a config file (although with Cloud1 we have direct access to both).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely log. Info level. Great suggestion.
"000016684-000000007.tsm", | ||
"000016812-000000004.tsm", | ||
"000016853-000000002.tsm", | ||
"000016948-000000004.tsm", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is wrong. This compaction group will merge these tsm files but that will produce incorrect overwrite semantics as the files in the group aren't adjacent.
The level 2 group looks right --- it pulled 4 level 2 files together into a group that are sequential. They will be compacted into a level 3 generation that will be correct for overwrites, producing a 000016844-3
But if the weighting of the scheduler picks this level4group
to be run, it'll compact 684-7, 812-5, 853-2, 948-4 together and produce a 000016684-8
which will be older than the one from the level 2 group (16684 < 16844) and overwrites will be broken.
So you might need to check there isn't a run of non-fully compacted generations longer than 4 (which is what the level 2 & 3 compactions look for.
if i - lastHighLevelIndex > 3 {
break
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found an issue in one of the tests -- makes me wonder if other tests are incorrectly combining tsm files that aren't adjacent. I think each test case could have a check that given the input files/generations and the expected output, all generations in all groups are adjacent -- just to verify the test is accurate too.
We were seeing a state in which a shard would not perform full compactions leading to a build up of level 4 TSM files.
File state:
There is a rouge level 2 file packed within fully compacted files
and level 4 files
The area of our code that would cause this state to be skipped would be here
influxdb/tsdb/engine/tsm1/compact.go
Lines 620 to 670 in 22bec4f
We need to add some sort of escape mechanism that would allow for compactions to occur or simplify this logic.
This PR adds a test for the bugged file listing where we experience the issue. I'm still trying to find the way files get in to this state, but, this should resolve the issue if and when it occurs. I've added the check in PlanOptimize() as to not change any of the core logic in Plan().
Closes #
Describe your proposed changes here.