mbox series

[v4,0/9] cgroup/cpuset: Support remote partitions

Message ID 20230627143508.1576882-1-longman@redhat.com (mailing list archive)
Headers show
Series cgroup/cpuset: Support remote partitions | expand

Message

Waiman Long June 27, 2023, 2:34 p.m. UTC
v4:
  - [v3] https://lore.kernel.org/lkml/20230627005529.1564984-1-longman@redhat.com/
  - Fix compilation problem reported by kernel test robot.

 v3:
  - [v2] https://lore.kernel.org/lkml/20230531163405.2200292-1-longman@redhat.com/
  - Change the new control file from root-only "cpuset.cpus.reserve" to
    non-root "cpuset.cpus.exclusive" which lists the set of exclusive
    CPUs distributed down the hierarchy.
  - Add a patch to restrict boot-time isolated CPUs to isolated
    partitions only.
  - Update the test_cpuset_prs.sh test script and documentation
    accordingly.

This patch series introduces a new cpuset control file
"cpuset.cpus.exclusive" which must be a subset of "cpuset.cpus"
and the parent's "cpuset.cpus.exclusive". This control file lists
the exclusive CPUs to be distributed down the hierarchy. Any one
of the exclusive CPUs can only be distributed to at most one child
cpuset. Unlike "cpuset.cpus", invalid input to "cpuset.cpus.exclusive"
will be rejected with an error. This new control file has no effect on
the behavior of the cpuset until it turns into a partition root. At that
point, its effective CPUs will be set to its exclusive CPUs unless some
of them are offline.

This patch series also introduces a new category of cpuset partition
called remote partitions. The existing partition category where the
partition roots have to be clustered around the root cgroup in a
hierarchical way is now referred to as local partitions.

A remote partition can be formed far from the root cgroup
with no partition root parent. While local partitions can be
created without touching "cpuset.cpus.exclusive" as it can be set
automatically if a cpuset becomes a local partition root. Properly set
"cpuset.cpus.exclusive" values down the hierarchy are required to create
a remote partition.

Both scheduling and isolated partitions can be formed in a remote
partition. A local partition can be created under a remote partition.
A remote partition, however, cannot be formed under a local partition
for now.

Modern container orchestration tools like Kubernetes use the cgroup
hierarchy to manage different containers. And it is relying on other
middleware like systemd to help managing it. If a container needs to
use isolated CPUs, it is hard to get those with the local partitions
as it will require the administrative parent cgroup to be a partition
root too which tool like systemd may not be ready to manage.

With this patch series, we allow the creation of remote partition
far from the root. The container management tool can manage the
"cpuset.cpus.exclusive" file without impacting the other cpuset
files that are managed by other middlewares. Of course, invalid
"cpuset.cpus.exclusive" values will be rejected and changes to
"cpuset.cpus" can affect the value of "cpuset.cpus.exclusive" due to
the requirement that it has to be a subset of the former control file.

Waiman Long (9):
  cgroup/cpuset: Inherit parent's load balance state in v2
  cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE
    handling
  cgroup/cpuset: Improve temporary cpumasks handling
  cgroup/cpuset: Allow suppression of sched domain rebuild in
    update_cpumasks_hier()
  cgroup/cpuset: Add cpuset.cpus.exclusive for v2
  cgroup/cpuset: Introduce remote partition
  cgroup/cpuset: Check partition conflict with housekeeping setup
  cgroup/cpuset: Documentation update for partition
  cgroup/cpuset: Extend test_cpuset_prs.sh to test remote partition

 Documentation/admin-guide/cgroup-v2.rst       |  100 +-
 kernel/cgroup/cpuset.c                        | 1347 ++++++++++++-----
 .../selftests/cgroup/test_cpuset_prs.sh       |  398 +++--
 3 files changed, 1291 insertions(+), 554 deletions(-)

Comments

Tejun Heo July 10, 2023, 9:08 p.m. UTC | #1
Hello, Waiman.

I applied the prep patches. They look good on their own.

On Tue, Jun 27, 2023 at 10:34:59AM -0400, Waiman Long wrote:
...
> cpuset. Unlike "cpuset.cpus", invalid input to "cpuset.cpus.exclusive"
> will be rejected with an error. This new control file has no effect on

We cannot maintain this as an invariant tho, right? For example, what
happens when a parent cgroup later wants to withdraw a CPU from its
cpuset.cpus which should always be allowed regardless of what its
descendants are doing? Even with cpus.exclusive itself, I think it'd be
important to always allow ancestors to be able to withdraw from the
commitment as with other resources. I suppose one can argue that giving
exclusive access to CPUs is a special case which doesn't follow this rule
but cpus.exclusive having to be nested inside cpus which is subject to that
rule makes that combination too contorted.

Would it be difficult to follow how isolation modes behave when the target
configuration can't be achieved?

Thanks.
Waiman Long July 11, 2023, 12:33 a.m. UTC | #2
On 7/10/23 17:08, Tejun Heo wrote:
> Hello, Waiman.
>
> I applied the prep patches. They look good on their own.
>
> On Tue, Jun 27, 2023 at 10:34:59AM -0400, Waiman Long wrote:
> ...
>> cpuset. Unlike "cpuset.cpus", invalid input to "cpuset.cpus.exclusive"
>> will be rejected with an error. This new control file has no effect on
> We cannot maintain this as an invariant tho, right? For example, what
> happens when a parent cgroup later wants to withdraw a CPU from its
> cpuset.cpus which should always be allowed regardless of what its
> descendants are doing? Even with cpus.exclusive itself, I think it'd be
> important to always allow ancestors to be able to withdraw from the
> commitment as with other resources. I suppose one can argue that giving
> exclusive access to CPUs is a special case which doesn't follow this rule
> but cpus.exclusive having to be nested inside cpus which is subject to that
> rule makes that combination too contorted.
>
> Would it be difficult to follow how isolation modes behave when the target
> configuration can't be achieved?

I would like to clarify that withdrawal of CPUs from 
cpuset.cpus.exclusive is always allowed. It is the addition of CPUs not 
presents in cpuset.cpus that will be rejected. The invariant is that 
cpuset.cpus.exclusive must always be a subset of cpuset.cpus. Any change 
that violates this rule is not allowed. Alternately I can silently 
dropped the offending CPUs without returning an error, but that may 
surprise users.

BTW, withdrawal of CPUs from cpuset.cpus will also withdraw them from 
cpuset.cpus.exclusive, if present. This allows the partition code to use 
cpuset.cpus.exclusive directly to determine the allowable exclusive CPUs 
without doing an intersection with cpuset.cpus each time it is used.

Please let me know if you want a different behavior.

Cheers,
Longman
Tejun Heo July 11, 2023, 1 a.m. UTC | #3
Hello,

On Mon, Jul 10, 2023 at 08:33:11PM -0400, Waiman Long wrote:
> I would like to clarify that withdrawal of CPUs from cpuset.cpus.exclusive
> is always allowed. It is the addition of CPUs not presents in cpuset.cpus
> that will be rejected. The invariant is that cpuset.cpus.exclusive must
> always be a subset of cpuset.cpus. Any change that violates this rule is not
> allowed. Alternately I can silently dropped the offending CPUs without
> returning an error, but that may surprise users.

Right, that'd be confusing.

> BTW, withdrawal of CPUs from cpuset.cpus will also withdraw them from
> cpuset.cpus.exclusive, if present. This allows the partition code to use
> cpuset.cpus.exclusive directly to determine the allowable exclusive CPUs
> without doing an intersection with cpuset.cpus each time it is used.

This is kinda confusing too, I think. Changing cpuset.cpus in an ancestor
doesn't affect the contents of the descendants' cpuset.cpus files but would
directly modify the contents of their cpuset.cpus.exclusive files.

There's some inherent friction because cpuset.cpus separates configuration
(cpuset.cpus) and the current state (cpuset.cpus.effective) while
cpuset.cpus.exclusive is trying to do both in the same interface file. When
the two behavior modes collide, it becomes rather confusing. Do you think
it'd make sense to make cpus.exclusive follow the same pattern as
cpuset.cpus?

Thanks.
Waiman Long July 11, 2023, 1:38 a.m. UTC | #4
On 7/10/23 21:00, Tejun Heo wrote:
> Hello,
>
> On Mon, Jul 10, 2023 at 08:33:11PM -0400, Waiman Long wrote:
>> I would like to clarify that withdrawal of CPUs from cpuset.cpus.exclusive
>> is always allowed. It is the addition of CPUs not presents in cpuset.cpus
>> that will be rejected. The invariant is that cpuset.cpus.exclusive must
>> always be a subset of cpuset.cpus. Any change that violates this rule is not
>> allowed. Alternately I can silently dropped the offending CPUs without
>> returning an error, but that may surprise users.
> Right, that'd be confusing.
>
>> BTW, withdrawal of CPUs from cpuset.cpus will also withdraw them from
>> cpuset.cpus.exclusive, if present. This allows the partition code to use
>> cpuset.cpus.exclusive directly to determine the allowable exclusive CPUs
>> without doing an intersection with cpuset.cpus each time it is used.
> This is kinda confusing too, I think. Changing cpuset.cpus in an ancestor
> doesn't affect the contents of the descendants' cpuset.cpus files but would
> directly modify the contents of their cpuset.cpus.exclusive files.
>
> There's some inherent friction because cpuset.cpus separates configuration
> (cpuset.cpus) and the current state (cpuset.cpus.effective) while
> cpuset.cpus.exclusive is trying to do both in the same interface file. When
> the two behavior modes collide, it becomes rather confusing. Do you think
> it'd make sense to make cpus.exclusive follow the same pattern as
> cpuset.cpus?

I don't want to add another cpuset.cpus.exclusive.effective control 
file. One possibility is to keep another effective masks in the struct 
cpuset and list both exclusive cpus set by the user and the effective 
ones side by side, like "<cpus> (<effective_cpus>)" if they differ or 
some other format. What do you think?

Regards,
Longman
Tejun Heo July 11, 2023, 1:45 a.m. UTC | #5
Hello,

On Mon, Jul 10, 2023 at 09:38:12PM -0400, Waiman Long wrote:
> I don't want to add another cpuset.cpus.exclusive.effective control file.
> One possibility is to keep another effective masks in the struct cpuset and
> list both exclusive cpus set by the user and the effective ones side by
> side, like "<cpus> (<effective_cpus>)" if they differ or some other format.
> What do you think?

Hmm... if we go for separate effective mask, I think it'd be better to stay
consistent with cpuset.cpus[.effective]. That's the convention both
cpuset.cpus and cpuset.mems already follow. I'm not sure what we'd gain by
deviating.

Thanks.