diff mbox series

[v7,5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst

Message ID 20210825213750.6933-6-longman@redhat.com (mailing list archive)
State New
Headers show
Series cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus | expand

Commit Message

Waiman Long Aug. 25, 2021, 9:37 p.m. UTC
Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced
"isolated" cpuset partition type as well as the ability to create
non-top cpuset partition with no cpu allocated to it.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 112 +++++++++++++-----------
 1 file changed, 63 insertions(+), 49 deletions(-)

Comments

Tejun Heo Aug. 26, 2021, 5:35 p.m. UTC | #1
Hello, Waiman.

Let's stop iterating on the patchset until we reach a consensus.

On Wed, Aug 25, 2021 at 05:37:49PM -0400, Waiman Long wrote:
>  	1) The "cpuset.cpus" is not empty and the list of CPUs are
>  	   exclusive, i.e. they are not shared by any of its siblings.

Part of it can be reached by cpus going offline.

>  	2) The parent cgroup is a partition root.

This condition can happen if a parent stop being a partition.

> -	3) The "cpuset.cpus" is also a proper subset of the parent's
> +	3) The "cpuset.cpus" is a subset of the parent's
>  	   "cpuset.cpus.effective".

This can happen if cpus go offline.

>  	4) There is no child cgroups with cpuset enabled.  This is for
>  	   eliminating corner cases that have to be handled if such a
>  	   condition is allowed.

This may make sense as a short cut for us but doesn't really stem from
interface or behavior requirements.

Of the four conditions listed, two are bogus (the states can be
reached through a different path and the configuration success or
failure can be timing dependent if configuration racaes against cpu
hotplug operations) and one maybe makes sense half-way and one is more
of a shortcut.

Can't we just replace these with transitions to invalid state with
proper explanation? That'd get rid of the error handling duplications
from both the kernel and user side, make automated configurations
which may race against hot plug operations reliable, and consistently
provide users with why something failed.

Thank you.
Waiman Long Aug. 27, 2021, 3:01 a.m. UTC | #2
On 8/26/21 1:35 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> Let's stop iterating on the patchset until we reach a consensus.
>
> On Wed, Aug 25, 2021 at 05:37:49PM -0400, Waiman Long wrote:
>>   	1) The "cpuset.cpus" is not empty and the list of CPUs are
>>   	   exclusive, i.e. they are not shared by any of its siblings.
> Part of it can be reached by cpus going offline.
>
>>   	2) The parent cgroup is a partition root.
> This condition can happen if a parent stop being a partition.
>
>> -	3) The "cpuset.cpus" is also a proper subset of the parent's
>> +	3) The "cpuset.cpus" is a subset of the parent's
>>   	   "cpuset.cpus.effective".
> This can happen if cpus go offline.
>
>>   	4) There is no child cgroups with cpuset enabled.  This is for
>>   	   eliminating corner cases that have to be handled if such a
>>   	   condition is allowed.
> This may make sense as a short cut for us but doesn't really stem from
> interface or behavior requirements.
>
> Of the four conditions listed, two are bogus (the states can be
> reached through a different path and the configuration success or
> failure can be timing dependent if configuration racaes against cpu
> hotplug operations) and one maybe makes sense half-way and one is more
> of a shortcut.
>
> Can't we just replace these with transitions to invalid state with
> proper explanation? That'd get rid of the error handling duplications
> from both the kernel and user side, make automated configurations
> which may race against hot plug operations reliable, and consistently
> provide users with why something failed.

What I am doing here is setting a high bar for transitioning from member 
to either "root" or "isolated". Once it becomes a partition, there are 
multiple ways that can make it invalid. I am fine with that. However, I 
am not sure it is a good idea to allow users to echo "root" to 
cpuset.cpus.partition anywhere in the cgroup hierarchy and require them 
to read it back to see if it succeed.

All the checking are done with cpuset_rwsem held. So there shouldn't be 
any racing. Of course, a hotplug can immediately follow and make the 
partition invalid.

Cheers,
Longman
Tejun Heo Aug. 27, 2021, 4 a.m. UTC | #3
Hello,

On Thu, Aug 26, 2021 at 11:01:30PM -0400, Waiman Long wrote:
> What I am doing here is setting a high bar for transitioning from member to
> either "root" or "isolated". Once it becomes a partition, there are multiple
> ways that can make it invalid. I am fine with that. However, I am not sure
> it is a good idea to allow users to echo "root" to cpuset.cpus.partition
> anywhere in the cgroup hierarchy and require them to read it back to see if
> it succeed.

The problem is that the "high" bar is rather arbitrary. It might feel like a
good idea to some but not to others. There are no clear technical reasons or
principles for rules to be set this particular way.

> All the checking are done with cpuset_rwsem held. So there shouldn't be any
> racing. Of course, a hotplug can immediately follow and make the partition
> invalid.

Imagine a system which dynamically on/offlines its cpus based on load or
whatever and also configures partitions for cases where the needed cpus are
online. If the partitions are set up while the cpus are online, it'd work as
expected - partitions are in effect when the system can support them and
ignored otherwise. However, if the partition configuration is attempted
while the cpus happen to be offline, the configuration will fail, and there
is no guaranteed way to make that configuration stick short of disabling
hotplug operations. This is a pretty jarring brekage happening exactly
because the behavior is an inconsistent amalgam.

It's usually not a good sign if interface restrictions can be added or
removed because how one feels without clear functional reasons and often
indicates that there's something broken, which seems to be the case here
too.

Thanks.
Waiman Long Aug. 27, 2021, 9:19 p.m. UTC | #4
On 8/27/21 12:00 AM, Tejun Heo wrote:
> Hello,
>
> On Thu, Aug 26, 2021 at 11:01:30PM -0400, Waiman Long wrote:
>> What I am doing here is setting a high bar for transitioning from member to
>> either "root" or "isolated". Once it becomes a partition, there are multiple
>> ways that can make it invalid. I am fine with that. However, I am not sure
>> it is a good idea to allow users to echo "root" to cpuset.cpus.partition
>> anywhere in the cgroup hierarchy and require them to read it back to see if
>> it succeed.
> The problem is that the "high" bar is rather arbitrary. It might feel like a
> good idea to some but not to others. There are no clear technical reasons or
> principles for rules to be set this particular way.
>
>> All the checking are done with cpuset_rwsem held. So there shouldn't be any
>> racing. Of course, a hotplug can immediately follow and make the partition
>> invalid.
> Imagine a system which dynamically on/offlines its cpus based on load or
> whatever and also configures partitions for cases where the needed cpus are
> online. If the partitions are set up while the cpus are online, it'd work as
> expected - partitions are in effect when the system can support them and
> ignored otherwise. However, if the partition configuration is attempted
> while the cpus happen to be offline, the configuration will fail, and there
> is no guaranteed way to make that configuration stick short of disabling
> hotplug operations. This is a pretty jarring brekage happening exactly
> because the behavior is an inconsistent amalgam.
>
> It's usually not a good sign if interface restrictions can be added or
> removed because how one feels without clear functional reasons and often
> indicates that there's something broken, which seems to be the case here
> too.

Well, that is a valid point. The cpus may have been offlined when a 
partition is being created. I can certainly relent on this check in 
forming a partition. IOW, cpus_allowed can contain some or all offline 
cpus and a valid (some are online) or invalid (all are offline) 
partition can be formed. I can also allow an invalid child partition to 
be formed with an invalid parent partition. However, the cpu exclusivity 
rules will still apply.

Other than that, do you envision any other circumstances where we should 
allow an invalid partition to be formed?

Cheers,
Longman
Tejun Heo Aug. 27, 2021, 9:27 p.m. UTC | #5
Hello,

On Fri, Aug 27, 2021 at 05:19:31PM -0400, Waiman Long wrote:
> Well, that is a valid point. The cpus may have been offlined when a
> partition is being created. I can certainly relent on this check in forming
> a partition. IOW, cpus_allowed can contain some or all offline cpus and a
> valid (some are online) or invalid (all are offline) partition can be
> formed. I can also allow an invalid child partition to be formed with an
> invalid parent partition. However, the cpu exclusivity rules will still
> apply.
> 
> Other than that, do you envision any other circumstances where we should
> allow an invalid partition to be formed?

Now that most restrictions are removed from configuration side, just go all
the way? Given that the user must check the status afterwards anyway, I
don't see technical or even usability reasons for leaving some pieces
behind. Going all the way would be easier to use too - bang in the target
config and read the resulting state to reliably find out why a partition
isn't valid, especially if we list *all* the reasons so that the user can
tell whether the configuration is as intended immediately.

Thanks.
Waiman Long Aug. 27, 2021, 10:50 p.m. UTC | #6
On 8/27/21 5:27 PM, Tejun Heo wrote:
> Hello,
>
> On Fri, Aug 27, 2021 at 05:19:31PM -0400, Waiman Long wrote:
>> Well, that is a valid point. The cpus may have been offlined when a
>> partition is being created. I can certainly relent on this check in forming
>> a partition. IOW, cpus_allowed can contain some or all offline cpus and a
>> valid (some are online) or invalid (all are offline) partition can be
>> formed. I can also allow an invalid child partition to be formed with an
>> invalid parent partition. However, the cpu exclusivity rules will still
>> apply.
>>
>> Other than that, do you envision any other circumstances where we should
>> allow an invalid partition to be formed?
> Now that most restrictions are removed from configuration side, just go all
> the way? Given that the user must check the status afterwards anyway, I
> don't see technical or even usability reasons for leaving some pieces
> behind. Going all the way would be easier to use too - bang in the target
> config and read the resulting state to reliably find out why a partition
> isn't valid, especially if we list *all* the reasons so that the user
> tell whether the configuration is as intended immediately.

The cpu exclusivity rule is due to the setting of CPU_EXCLUSIVE bit. 
This is a pre-existing condition unless you want to change how the 
cpuset.cpu_exclusive works.

So the new rules will be:

1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive.
2) The parent cgroup is a partition root (can be an invalid one).
3) The "cpuset.cpus" is a subset of the parent's cpuset.cpus.allowed.
4) No child cgroup with cpuset enabled.

I think they are reasonable. What do you think?

Cheers,
Longman
Tejun Heo Aug. 27, 2021, 11:35 p.m. UTC | #7
Hello,

On Fri, Aug 27, 2021 at 06:50:10PM -0400, Waiman Long wrote:
> The cpu exclusivity rule is due to the setting of CPU_EXCLUSIVE bit. This is
> a pre-existing condition unless you want to change how the
> cpuset.cpu_exclusive works.
>
> So the new rules will be:
> 
> 1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive.

Empty cpu list can be considered an exclusive one.

> 2) The parent cgroup is a partition root (can be an invalid one).

Does this mean a partition parent can't stop being a partition if one or
more of its children become partitions? If so, it violates the rule that a
descendant shouldn't be able to restrict what its ancestors can do.

> 3) The "cpuset.cpus" is a subset of the parent's cpuset.cpus.allowed.

Why not just go by effective? This would mean that a parent can't withdraw
CPUs from its allowed set once descendants are configured. Restrictions like
this are fine when the entire hierarchy is configured by a single entity but
become awkward when configurations are multi-tiered, automated and dynamic.

> 4) No child cgroup with cpuset enabled.

idk, maybe? I'm having a hard time seeing the point in adding these
restrictions when the state transitions are asynchronous anyway. Would it
help if we try to separate what's absoluately and technically necessary and
what seems reasonable or high bar and try to justify why each of the latter
should be added?

Thanks.
Waiman Long Aug. 28, 2021, 1:14 a.m. UTC | #8
On 8/27/21 7:35 PM, Tejun Heo wrote:
> Hello,
>
> On Fri, Aug 27, 2021 at 06:50:10PM -0400, Waiman Long wrote:
>> The cpu exclusivity rule is due to the setting of CPU_EXCLUSIVE bit. This is
>> a pre-existing condition unless you want to change how the
>> cpuset.cpu_exclusive works.
>>
>> So the new rules will be:
>>
>> 1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive.
> Empty cpu list can be considered an exclusive one.
It doesn't make sense to me to have a partition with no cpu configured 
at all. I very much prefer the users to set cpuset.cpus first before 
turning it into a partition.
>
>> 2) The parent cgroup is a partition root (can be an invalid one).
> Does this mean a partition parent can't stop being a partition if one or
> more of its children become partitions? If so, it violates the rule that a
> descendant shouldn't be able to restrict what its ancestors can do.

No. As I said in the documentation, transitioning from partition root to 
member is allowed. Against, it is illogical to allow a cpuset to become 
a potential partition if it parent is not even a partition root at all. 
In the case that the parent is reverted back to a member, the child 
partitions will stay invalid forever unless the parent become a valid 
partition again.

>
>> 3) The "cpuset.cpus" is a subset of the parent's cpuset.cpus.allowed.
> Why not just go by effective? This would mean that a parent can't withdraw
> CPUs from its allowed set once descendants are configured. Restrictions like
> this are fine when the entire hierarchy is configured by a single entity but
> become awkward when configurations are multi-tiered, automated and dynamic.

The original rule is to be based on effective cpus. However, to properly 
handle the case of allowing offlined cpus to be included in the 
partition, I have to change it to cpu_allowed instead. I can certainly 
change it back to effective if you prefer.

>
>> 4) No child cgroup with cpuset enabled.
> idk, maybe? I'm having a hard time seeing the point in adding these
> restrictions when the state transitions are asynchronous anyway. Would it
> help if we try to separate what's absoluately and technically necessary and
> what seems reasonable or high bar and try to justify why each of the latter
> should be added?

This rule is there mainly for ease of implementation. Otherwise, I need 
to add additional code to handle the conversion of child cpusets which 
can be rather complex and require a lot more debugging. This rule will 
no longer apply once the cpuset becomes a partition root.

Cheers,
Longman
Michal Koutný Aug. 30, 2021, 5:59 p.m. UTC | #9
Hello.

On Fri, Aug 27, 2021 at 06:50:10PM -0400, Waiman Long <llong@redhat.com> wrote:
> So the new rules will be:

When I followed the thread, it seemed to me you're talking past each
other a bit. I'd suggest the following terminology:

- config space: what's written by the user and saved,

- reality space: what's currently available (primarily subject to
  on-/offlinng but I think it'd be helpful to consider here also what's
  given by the parent),

- effect space: what's actually possible and happening.

Not all elements of config_space x reality_space (Cartesian product) can
be represented in the effect_space (e.g. root partition with no
(effective) cpus).

IIUC, Waiman's "high bar" is supposed to be defined over transitions in
the config_space. However, there can be independent changes in the
reality_space so the rules should be actually formulated in the
effect_space:

The conditions for being a valid partition root rewritten into the effect
space:

> 1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive.
- effective CPUs are non-empty and exclusive wrt siblings
- (E.g. setting empty cpuset.cpus might be possible but it invalidates
  the partition root, same as offlining or removal by an ancestor.)

> 2) The parent cgroup is a partition root (can be an invalid one).
- parent cgroup is a (valid) partition
- (Being valid partition means owning "stolen" cpus from the parent, if
  the parent is not valid partition itself, you can't steal what is not
  owned.)
- (And I think it's OK that: "the child partitions will stay invalid
  forever unless the parent become a valid partition again" [1].)

> 3) The "cpuset.cpus" is a subset of the parent's cpuset.cpus.allowed.
- I'm not sure what is the use of this condition (together with the
  rewrite of the 1st condition which covers effective cpus). I think it
  would make sense if being a valid parition root guaranteed that all
  configured cpuset.cpus will be available, however, that's not the case
  IIUC (e.g. due to offlining).

> 4) No child cgroup with cpuset enabled.
- A child cgroup with cpuset enabled is OK in the effect space
  (achievable by switching first and creating children later).
- For technical reasons this may be a condition on the transitions in
  the config_space.

Generally, most config changes should succeed and user should check (or
watch) how they landed in combination with the reality_space.

Regards,
Michal

[1] This follows the general model where ancestors can "preempt"
resources from their subtree.
Michal Koutný Oct. 12, 2021, 2:39 p.m. UTC | #10
On Wed, Oct 06, 2021 at 02:21:03PM -0400, Waiman Long <llong@redhat.com> wrote:
> Sorry for not following up with this patchset sooner as I was busy on other
> tasks.

Thanks for continuing with this.

> 	1) The "cpuset.cpus" is not empty and the list of CPUs are
> 	   exclusive, i.e. they are not shared by any of its siblings.
> 	2) The parent cgroup is a partition root.
> 	3) The "cpuset.cpus" is a subset of the union of parent's
> 	   "cpuset.cpus.effective" and offlined CPUs in parent's
> 	   "cpuset.cpus".
> 	4) There is no child cgroups with cpuset enabled.  This avoids
> 	   cpu migrations of multiple cgroups simultaneously which can
> 	   be problematic.
> 
>         A partition, when enabled, can be in an invalid state. An example
>         is when its parent is also an invalid partition.

You say:
"it can only be enabled in a cgroup if all the following conditions are met.",
"2) The parent cgroup is a partition root."

and then the example:
"A partition, when enabled, can be in an invalid state. An example is
when its parent is also an invalid partition."

But the first two statements imply you can't have enabled the partition
in such a case.

I think there is still mixup of partition validity conditions and
transition conditions, yours would roughly divide into (not precisely,
just to share my understanding):

Validity conditions
 	1) The "cpuset.cpus" is not empty and the list of CPUs are
 	   exclusive, i.e. they are not shared by any of its siblings.
 	2) The parent cgroup is a partition root.

Transition conditions:
 	3) The "cpuset.cpus" is a subset of the union of parent's
 	   "cpuset.cpus.effective" and offlined CPUs in parent's
 	   "cpuset.cpus".
 	4) There is no child cgroups with cpuset enabled.  This avoids
 	   cpu migrations of multiple cgroups simultaneously which can
 	   be problematic.

(I've put no. 3 into transition conditions because _after_ the
transition parent's cpuset.cpus.effective are subtracted the new root's
cpuset.cpus but I'd like to have something similar as a validity
condition but I haven't come up with that yet.)

I consider the following situation:

r		// all cpus 0-7
`- part1	cpus=0-3	root >partition
   ` subpart1	cpus=0-1	root >partition
   ` subpart2	cpus=2-3	root >partition
`- other	cpus=4-7	// member by default

Both subpart1 and subpart2 are valid partition roots.
Look at actions listed below (as alternatives, not a sequence):

a) hotplug offlines cpu 3
  - would part1 still be considered a valid root? 
    - perhaps not
  - would subpart1 still be considered a valid root? 
    - it could be, but its parent is invalid so no?
  - would subpart2 still be considered a valid root? 
    - perhaps not
    
b) administrative change writes 0-2 into part1 cpus
  - would part1 still be considered a valid root? 
    - yes
  - would subpart1 still be considered a valid root? 
    - yes
  - would subpart2 still be considered a valid root? 
    - perhaps not

c) administrative change writes 3-7 into `other` cpus
  - should this fail or invalidate a root partition part1?
    - perhaps fail since the same "owner" manages all siblings and
      should reduce part1 first

The answers above are just my "natural" responses, the ideal may be
different. The issue I want to illustrate is that if all the conditions
are formed as transition conditions only, they can't be used to reason
about hotplug or config changes (except for cpuset.cpus.partitions
writes).

What would help me with the understanding -- the invalid root partition is defined as
1) such a cgroup where no cpus are granted from the top (and thus has to fall back to ancestors)
or
2) such a cgroup where cpus requested in cpuset.cpus can't be fulfilled (i.e. any missing invalidates)?

Furthermore, another example (motivated by the patch 4/6)

r		// all cpus 0-7
`- part1	cpus=0-4	root >partition
   ` subpart1	cpus=0-1	root >partition
   ` subpart2	cpus=2-3	root >partition
   ` task
`- other	cpus=5-7	// member by default

It's a valid and achievable state (even on v2 since cpuset is a threaded
controller). 

a) cpu 4 is offlined
  - this should invalidate part1 (and propagate invalidation into
    subpart1 and subpart2).
b) administrative write 0-3 into part1 cpus
  - should this invalidate part1 or be rejected?


In conclusion, it'd be good to have validity conditions separate from
transition conditions (since hotplug transition can't be rejected) and
perhaps treat administrative changes from an ancestor equally as a
hotplug.

Thanks,
Michal
Waiman Long Oct. 13, 2021, 9:45 p.m. UTC | #11
On 10/12/21 10:39 AM, Michal Koutný wrote:
> On Wed, Oct 06, 2021 at 02:21:03PM -0400, Waiman Long <llong@redhat.com> wrote:
>> Sorry for not following up with this patchset sooner as I was busy on other
>> tasks.
> Thanks for continuing with this.
>
>> 	1) The "cpuset.cpus" is not empty and the list of CPUs are
>> 	   exclusive, i.e. they are not shared by any of its siblings.
>> 	2) The parent cgroup is a partition root.
>> 	3) The "cpuset.cpus" is a subset of the union of parent's
>> 	   "cpuset.cpus.effective" and offlined CPUs in parent's
>> 	   "cpuset.cpus".
>> 	4) There is no child cgroups with cpuset enabled.  This avoids
>> 	   cpu migrations of multiple cgroups simultaneously which can
>> 	   be problematic.
>>
>>          A partition, when enabled, can be in an invalid state. An example

Thanks for the comments.


>>          is when its parent is also an invalid partition.
> You say:
> "it can only be enabled in a cgroup if all the following conditions are met.",
> "2) The parent cgroup is a partition root."
>
> and then the example:
> "A partition, when enabled, can be in an invalid state. An example is
> when its parent is also an invalid partition."
>
> But the first two statements imply you can't have enabled the partition
> in such a case.

Yes, you are right. We should not allow enabling partition when the 
parent is an invalid right. I will fix that.


> I think there is still mixup of partition validity conditions and
> transition conditions, yours would roughly divide into (not precisely,
> just to share my understanding):
>
> Validity conditions
>   	1) The "cpuset.cpus" is not empty and the list of CPUs are
>   	   exclusive, i.e. they are not shared by any of its siblings.
>   	2) The parent cgroup is a partition root.
>
> Transition conditions:
>   	3) The "cpuset.cpus" is a subset of the union of parent's
>   	   "cpuset.cpus.effective" and offlined CPUs in parent's
>   	   "cpuset.cpus".

I am going to change this condition to just "cpuset.cpus" is a subset of 
parent's "cpuset.cpus". After some deliberation, I  had concluded it 
doesn't make sense from the system partition planning point of view to 
allow a valid partition to contain cpus that are not in the designated 
"cpuset.cpus". That will automatically included offlined cpus in 
parent's "cpuset.cpus".


>   	4) There is no child cgroups with cpuset enabled.  This avoids
>   	   cpu migrations of multiple cgroups simultaneously which can
>   	   be problematic.
>
> (I've put no. 3 into transition conditions because _after_ the
> transition parent's cpuset.cpus.effective are subtracted the new root's
> cpuset.cpus but I'd like to have something similar as a validity
> condition but I haven't come up with that yet.)
>
> I consider the following situation:
>
> r		// all cpus 0-7
> `- part1	cpus=0-3	root >partition
>     ` subpart1	cpus=0-1	root >partition
>     ` subpart2	cpus=2-3	root >partition
> `- other	cpus=4-7	// member by default
>
> Both subpart1 and subpart2 are valid partition roots.
> Look at actions listed below (as alternatives, not a sequence):
>
> a) hotplug offlines cpu 3
>    - would part1 still be considered a valid root?
>      - perhaps not
>    - would subpart1 still be considered a valid root?
>      - it could be, but its parent is invalid so no?
>    - would subpart2 still be considered a valid root?
>      - perhaps not
>      

They will all be valid roots. They will become invalid only when their 
effective cpus are empty and there are tasks in the partition.

> b) administrative change writes 0-2 into part1 cpus

That is actually not allowed because of the following code in 
validate_change():

static int validate_change(struct cpuset *cur, struct cpuset *trial)
{
     :
         /* Each of our child cpusets must be a subset of us */
         ret = -EBUSY;
         cpuset_for_each_child(c, css, cur)
                 if (!is_cpuset_subset(c, trial))
                         goto out;

>    - would part1 still be considered a valid root?
>      - yes
>    - would subpart1 still be considered a valid root?
>      - yes
>    - would subpart2 still be considered a valid root?
>      - perhaps not
>
> c) administrative change writes 3-7 into `other` cpus
>    - should this fail or invalidate a root partition part1?
>      - perhaps fail since the same "owner" manages all siblings and
>        should reduce part1 first
Again, this will not be allowed because of the CPU_EXCLUSIVE flag set in 
part1.
>
> The answers above are just my "natural" responses, the ideal may be
> different. The issue I want to illustrate is that if all the conditions
> are formed as transition conditions only, they can't be used to reason
> about hotplug or config changes (except for cpuset.cpus.partitions
> writes).
>
> What would help me with the understanding -- the invalid root partition is defined as
> 1) such a cgroup where no cpus are granted from the top (and thus has to fall back to ancestors)
> or
> 2) such a cgroup where cpus requested in cpuset.cpus can't be fulfilled (i.e. any missing invalidates)?
For a valid partition, "cpuset.cpus.effective" is always a subset of 
"cpuset.cpus". When "cpuset.cpus.effective" becomes empty and there are 
tasks in the partition, it becomes invalid and inherent the non-empty 
cpuset.cpus.effective of the nearest ancestor. The condition that causes 
"cpuset.cpus.effective" to become empty can be hotplug or changes to 
"cpuset.cpus".
> Furthermore, another example (motivated by the patch 4/6)
>
> r		// all cpus 0-7
> `- part1	cpus=0-4	root >partition
>     ` subpart1	cpus=0-1	root >partition
>     ` subpart2	cpus=2-3	root >partition
>     ` task
> `- other	cpus=5-7	// member by default
>
> It's a valid and achievable state (even on v2 since cpuset is a threaded
> controller).
>
> a) cpu 4 is offlined
>    - this should invalidate part1 (and propagate invalidation into
>      subpart1 and subpart2).

That is subject to design. My current thought is to keep part1 as valid 
but invalidate the child partitions (subpart1 and subpart2).


> b) administrative write 0-3 into part1 cpus
>    - should this invalidate part1 or be rejected?

The result should be the same as (a).

>
> In conclusion, it'd be good to have validity conditions separate from
> transition conditions (since hotplug transition can't be rejected) and
> perhaps treat administrative changes from an ancestor equally as a
> hotplug.

I am trying to make the result of changing "cpuset.cpus" as close to 
hotplug as possible but there are cases where the "cpuset.cpus" change 
is prohibited but hotplug can still happen to remove the cpu.

Hope this will help to clarify the current design.

Cheers,
Longman
Waiman Long Oct. 13, 2021, 10:11 p.m. UTC | #12
On 10/13/21 5:45 PM, Waiman Long wrote:
>
>
>>
>> In conclusion, it'd be good to have validity conditions separate from
>> transition conditions (since hotplug transition can't be rejected) and
>> perhaps treat administrative changes from an ancestor equally as a
>> hotplug.
>
> I am trying to make the result of changing "cpuset.cpus" as close to 
> hotplug as possible but there are cases where the "cpuset.cpus" change 
> is prohibited but hotplug can still happen to remove the cpu.
>
> Hope this will help to clarify the current design.
>
BTW, the attached file is the current draft of cpuset.cpus.partition 
document.

Cheers,
Longman
cpuset.cpus.partition
	A read-write single value file which exists on non-root
	cpuset-enabled cgroups.  This flag is owned by the parent cgroup
	and is not delegatable.

	It accepts only the following input values when written to.

	  ========	================================
	  "member"	Non-root member of a partition
	  "root"	Partition root
	  "isolated"	Partition root without load balancing
	  ========	================================

	When set to be a partition root, the current cgroup is the
	root of a new partition or scheduling domain that comprises
	itself and all its descendants except those that are separate
	partition roots themselves and their descendants.  The root
	cgroup is always a partition root.

	When set to "isolated", the CPUs in that partition root will
	be in an isolated state without any load balancing from the
	scheduler.  Tasks in such a partition must be explicitly bound
	to each individual CPU.

	"cpuset.cpus" must always be set up first before enabling
	partition.  Unlike "member" whose "cpuset.cpus.effective" can
	contain CPUs not in "cpuset.cpus", this can never happen with a
	valid partition root.  In other words, "cpuset.cpus.effective"
	is always a subset of "cpuset.cpus" for a valid partition root.

	When a parent partition root cannot exclusively grant any of
	the CPUs specified in "cpuset.cpus", "cpuset.cpus.effective"
	becomes empty. If there are tasks in the partition root, the
	partition root becomes invalid and "cpuset.cpus.effective"
	is reset to that of the nearest non-empty ancestor.

        Note that a task cannot be moved to a cgroup with empty
        "cpuset.cpus.effective".

	There are additional constraints on where a partition root can
	be enabled ("root" or "isolated").  It can only be enabled in
	a cgroup if all the following conditions are met.

	1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are
	   not shared by any of its siblings.
	2) The parent cgroup is a valid partition root.
	3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus".
	4) There is no child cgroups with cpuset enabled.  This avoids
	   cpu migrations of multiple cgroups simultaneously which can
	   be problematic.

	On read, the "cpuset.cpus.partition" file can show the following
	values.

	  ======================	==============================
	  "member"			Non-root member of a partition
	  "root"			Partition root
	  "isolated"			Partition root without load balancing
	  "root invalid (<reason>)"	Invalid partition root
	  ======================	==============================

        In the case of an invalid partition root, a descriptive string on
        why the partition is invalid is included within parentheses.

	Once becoming a partition root, changes to "cpuset.cpus" is
	generally allowed as long as the cpu list is exclusive and is
	a superset of children's cpu lists.

        The constraints of a valid partition root are as follows:

        1) "cpuset.cpus" is non-empty and exclusive.
        2) The parent cgroup is a valid partition root.
        3) "cpuset.cpus.effective" is a subset of "cpuset.cpus"
        4) "cpuset.cpus.effective" is non-empty when there are tasks
           in the partition.

	Changes to "cpuset.cpus" or cpu hotplug may cause the state
	of a valid partition root to become invalid when one or more
	constraints of a valid partition root are violated.  Therefore,
	user space agents that manage partition roots should avoid
	unnecessary changes to "cpuset.cpus" and always check the state
	of "cpuset.cpus.partition" after making changes to make sure
	that the partitions are functioning properly as expected.

        Changing a partition root to "member" is always allowed.
        If there are child partition roots underneath it, however,
        they will be forced to be switched back to "member" too and
        lose their partitions. So care must be taken to double check
        for this condition before disabling a partition root.

	Setting a cgroup to a valid partition root will take the CPUs
	away from the effective CPUs of the parent partition.

	A valid parent partition may distribute out all its CPUs to
	its child partitions as long as it is not the root cgroup as
	we need some house-keeping CPUs in the root cgroup.

	An invalid partition is not a real partition even though some
	internal states may still be kept.

	An invalid partition root can be reverted back to a real
	partition root if none of the constraints of a valid partition
        root are violated.

	Poll and inotify events are triggered whenever the state of
	"cpuset.cpus.partition" changes.  That includes changes caused by
	write to "cpuset.cpus.partition", cpu hotplug and other changes
	that make the partition invalid.  This will allow user space
	agents to monitor unexpected changes to "cpuset.cpus.partition"
	without the need to do continuous polling.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index babbe04c8d37..e759b0898bce 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2091,8 +2091,9 @@  Cpuset Interface Files
 	It accepts only the following input values when written to.
 
 	  ========	================================
-	  "root"	a partition root
-	  "member"	a non-root member of a partition
+	  "member"	Non-root member of a partition
+	  "root"	Partition root
+	  "isolated"	Partition root without load balancing
 	  ========	================================
 
 	When set to be a partition root, the current cgroup is the
@@ -2101,64 +2102,77 @@  Cpuset Interface Files
 	partition roots themselves and their descendants.  The root
 	cgroup is always a partition root.
 
-	There are constraints on where a partition root can be set.
-	It can only be set in a cgroup if all the following conditions
-	are true.
+	When set to "isolated", the CPUs in that partition root will
+	be in an isolated state without any load balancing from the
+	scheduler.  Tasks in such a partition must be explicitly bound
+	to each individual CPU.
+
+	There are constraints on where a partition root can be set
+	("root" or "isolated").  It can only be set in a cgroup if all
+	the following conditions are true.
 
 	1) The "cpuset.cpus" is not empty and the list of CPUs are
 	   exclusive, i.e. they are not shared by any of its siblings.
 	2) The parent cgroup is a partition root.
-	3) The "cpuset.cpus" is also a proper subset of the parent's
+	3) The "cpuset.cpus" is a subset of the parent's
 	   "cpuset.cpus.effective".
 	4) There is no child cgroups with cpuset enabled.  This is for
 	   eliminating corner cases that have to be handled if such a
 	   condition is allowed.
 
-	Setting it to partition root will take the CPUs away from the
-	effective CPUs of the parent cgroup.  Once it is set, this
-	file cannot be reverted back to "member" if there are any child
+	Setting it to a partition root will take the CPUs away from the
+	effective CPUs of the parent cgroup.  Once it is set, this file
+	should not be reverted back to "member" if there are any child
 	cgroups with cpuset enabled.
 
-	A parent partition cannot distribute all its CPUs to its
-	child partitions.  There must be at least one cpu left in the
-	parent partition.
-
-	Once becoming a partition root, changes to "cpuset.cpus" is
-	generally allowed as long as the first condition above is true,
-	the change will not take away all the CPUs from the parent
-	partition and the new "cpuset.cpus" value is a superset of its
-	children's "cpuset.cpus" values.
-
-	Sometimes, external factors like changes to ancestors'
-	"cpuset.cpus" or cpu hotplug can cause the state of the partition
-	root to change.  On read, the "cpuset.sched.partition" file
-	can show the following values.
-
-	  ==============	==============================
-	  "member"		Non-root member of a partition
-	  "root"		Partition root
-	  "root invalid"	Invalid partition root
-	  ==============	==============================
-
-	It is a partition root if the first 2 partition root conditions
-	above are true and at least one CPU from "cpuset.cpus" is
-	granted by the parent cgroup.
-
-	A partition root can become invalid if none of CPUs requested
-	in "cpuset.cpus" can be granted by the parent cgroup or the
-	parent cgroup is no longer a partition root itself.  In this
-	case, it is not a real partition even though the restriction
-	of the first partition root condition above will still apply.
-	The cpu affinity of all the tasks in the cgroup will then be
-	associated with CPUs in the nearest ancestor partition.
-
-	An invalid partition root can be transitioned back to a
-	real partition root if at least one of the requested CPUs
-	can now be granted by its parent.  In this case, the cpu
-	affinity of all the tasks in the formerly invalid partition
-	will be associated to the CPUs of the newly formed partition.
-	Changing the partition state of an invalid partition root to
-	"member" is always allowed even if child cpusets are present.
+	A parent partition may distribute all its CPUs to its child
+	partitions as long as it is not the root cgroup.
+
+	Once becoming a partition root, changes to "cpuset.cpus"
+	is generally allowed as long as the first condition above
+	(cpu exclusivity rule) is true.
+
+	Sometimes, changes to "cpuset.cpus" or cpu hotplug may cause
+	the state of the partition root to become invalid when the
+	other constraints of partition root are violated.  Therefore,
+	user space agents that manage partition roots should avoid
+	unnecessary changes to "cpuset.cpus" and monitor the state of
+	"cpuset.cpus.partition" to make sure that the partitions are
+	functioning as expected.
+
+	On read, the "cpuset.cpus.partition" file can show the following
+	values.
+
+	  ======================	==============================
+	  "member"			Non-root member of a partition
+	  "root"			Partition root
+	  "isolated"			Partition root without load balancing
+	  "root invalid (<reason>)"	Invalid partition root
+	  ======================	==============================
+
+	A partition root becomes invalid if all the CPUs requested in
+	"cpuset.cpus" become unavailable.  This can happen if all the
+	CPUs have been offlined, or the state of an ancestor partition
+	root become invalid. "<reason>" is a string that describes why
+	the partition becomes invalid.
+
+	An invalid partition is not a real partition even though some
+	internal states may still be kept.  The cpu affinity of all
+	the tasks in the cgroup will then be associated with CPUs in
+	the nearest ancestor partition.
+
+	An invalid partition root can be reverted back to a real
+	partition root if at least one of the requested CPUs become
+	available again.  In this case, the cpu affinity of all the
+	tasks in the formerly invalid partition will be associated to
+	the CPUs of the newly formed partition.
+
+	Poll and inotify events are triggered whenever the state of
+	"cpuset.cpus.partition" changes.  That includes changes caused by
+	write to "cpuset.cpus.partition", cpu hotplug and other changes
+	that make the partition invalid.  This will allow user space
+	agents to monitor unexpected changes to "cpuset.cpus.partition"
+	without the need to do continuous polling.
 
 
 Device controller