Skip to content

Conversation

zhangyue-hashdata
Copy link
Contributor

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


zhangyue-hashdata and others added 10 commits August 29, 2025 12:40
let resource group io limit testing can be reproduced.

If we retain the objects created in the testing, we must clear those
objects before we re-run the testing on local, it's not convenient for
developers.
* add function to clear io.max

This pr has several improvements for io limit:

1. Add a function to clear io.max. This function should be used when
   alter io_limit.

2. Check tablespace in io_limit when drop tablespaces. If the
   tablespace which will be dropped presents in some io_limit resource
   groups, the drop tablespace statement will be aborted.

3. When InitResGroup and AlterResourceGroup, if parseio raises an error,
   the error will be demote to WARNING. So the cluster can launch when
   some tablespace has been removed.
Fix resource group io limit flaky case.

The flaky case caused by running mkdir on multi segments at the same
host.

Just catch FileExistsError and ignore it is ok, the mkdir function just
need the dir exists.
When io_limit encountered syntax error, previous log is just
"Error: Syntax error".

Now, the io_limit has comprehensive log for syntax error:

```
demo=# create resource group rg1 WITH (cpu_max_percent=10, 
                                       io_limit='pg_defaultrbps=100, 
                                       wbps=550,riops=1000,wiops=1000');
ERROR:  io limit: syntax error, unexpected '=', expecting ':'
HINT:   pg_defaultrbps=100, wbps=550,riops=1000,wiops=1000
                      ^
```

```
demo=# create resource group rg1 WITH (cpu_max_percent=10,
                                       io_limit='pg_default:
                                       rbps=100wbps=550,riops=1000,
                                       wiops=1000');
ERROR:  io limit: syntax error, unexpected IO_KEY, expecting end of file or ';'
HINT:   pg_default:rbps=100wbps=550,riops=1000,wiops=1000
                           ^
```
io limit: fix double free.

In 'alterResgroupCallback', the io_limit pointer of 'caps' and 'oldCaps'
maybe point to the same location, so there is a double free potentially.

In 'alterResgroupCallback', the 'oldCaps' will be filled in
'GetResGroupCapabilities', and the assign it to 'caps' via:
caps = oldCaps

To resolve this problem, the code should free the oldCaps.io_limit, and
set it to NIL, when the io_limit has not been altered.

So, if the io_limit has not been altered, caps.io_limit =
oldCaps.io_limit = NIL. If io_limit has been altered, caps.io_limit !=
oldCaps.io_limit.
Add one more hierarchy for resource group when use cgroup v2.

Current leaf node in the gpdb cgroup hierarchy is:
/sys/fs/cgroup/gpdb/<oid>, it's ok for gpdb workflow. But for some
extensions which want to use gpdb cgroup hierarchy, it's not convenient.

Extensions like plcontainer want create sub-cgroup under
/sys/fs/cgroup/<oid> as new leaf node, it's not possible in current
hierarchy, because of no internal processes constraint of cgroup v2.

This commit use a new hierarchy to adopt extensions which want to use
gpdb cgroup hierarchy, and the modification is tiny: move processes from
/sys/fs/cgroup/<oid>/cgroup.procs to
/sys/fs/cgroup/gpdb/<oid>/queries/cgroup.procs, and keep limitations in
/sys/fs/cgroup/<oid>.

With this modification, extensions which want to use gpdb cgroup
hierarchy can create sub cgroup under /sys/fs/cgroup/gpdb/<oid>.
For example, plcontainer will create a cgroup
/sys/fs/cgroup/gpdb/<oid>/docker-12345 and put processes into it.
delete cgroup leaf dir only when use group-v2.

There is no leaf directory in gpdb cgroup when use cgroup v1, so the
rmdir(leaf_path) will always return non-zero values, then the rmdir(path)
will be ignored.
When drop some resource groups, when corresponding cgroup dir cannot be
removed because the rmdire(path) is not executed, this behavior will
cause the failure of CI.

This commit add some logic to check resource group version in deleteDir,
when use group-v1, rmdir(leaf_path) will be ignored.
Add guc: gp_resource_group_cgroup_parent (only for cgroup v2).

Current gpdb doesn't support change root cgroup path of resource group.
For some situations, it's better if gpdb can change the root cgroup path
of resource group.

For example, on the OS with systemd, user maybe want to create a
delegated cgroup to gpdb via systemd, but the delegated cgroup must end
with .service which typically is /sys/fs/cgroup/gpdb.service. And in
other OS without systemd, user maybe want to use /sys/fs/cgroup/gpdb or
other locations directly. So add the gp_resource_group_cgroup_parent can
make the resource group more flexible.
Fix no response when alter io_limit of resource group to '-1'.

There is no action when ALTER RESOURCE GROUP xxx SET IO_LIMIT '-1'
before.

Now the action is that clear the content of io.max and update relation
pg_resgroupcapability.
else:
self.die("failed to retrieve gp_resource_group_cgroup_parent parameter from database")

except DatabaseError as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbconn.connect throws ConnectionError and pgdb.OperationalError. It seems useless to catch DatabaseError here. Just keep except Exception as e:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix it in b525d0d

"""
try:
dburl = dbconn.DbURL()
conn = dbconn.connect(dburl, utility=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use with dbconn.connect(...) as conn to ensure conn close?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix in a81054f

…eter

This commit fixes issues introduced in
"Add guc: gp_resource_group_cgroup_parent (#16738)" where the
gp_resource_group_cgroup_parent GUC parameter was added but the
gpcheckresgroupv2impl script still used hardcoded "gpdb" paths.

Changes:
- Implement get_cgroup_parent() method to dynamically retrieve the
  gp_resource_group_cgroup_parent value from database
- Replace all hardcoded "gpdb" paths with dynamic cgroup parent value
- Improve error handling in cgroup.c with more descriptive error messages
- Fix test configuration order: set gp_resource_group_cgroup_parent before
  enabling gp_resource_manager=group-v2 to avoid validation failures

This ensures the cgroup validation script works correctly with custom
cgroup parent directories configured via the GUC parameter, making the
resource group feature more flexible for different deployment scenarios.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants