Message ID | 20211018143619.205065-6-longman@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus | expand |
Hello. On Mon, Oct 18, 2021 at 10:36:18AM -0400, Waiman Long <longman@redhat.com> wrote: > + When set to "isolated", the CPUs in that partition root will > + be in an isolated state without any load balancing from the > + scheduler. Tasks in such a partition must be explicitly bound > + to each individual CPU. This sounds reasonable but it seems to have some usability issues as was raised in another thread [1]. (I could only think of the workaround of single-cpu cgroup leaves + CLONE_INTO_CGROUP.) TL;DR Do whatever you find suitable but (re)consider sticking to the delegation principle (making hotplug and ancestor changes equal). Now to the constraints and partition setups. I think it's useful to have a model with which the implementation can be compared with. I tried to condense some "simple rules" from the descriptions you posted in v8 plus your response to my remarks in v7 [2]. These should only be the "validity conditions", not "transition conditions". ## Validity conditions For simplification, there's a condition called 'degraded' that tells whether a cpuset can host tasks (with the given config) that expands to two predicates: degraded := cpus.internal_effective == ø && has_tasks valid_root := !degraded && cpus_exclusive && parent.valid_root (valid_member := !degraded) with a helping predicate cpus_exclusive := cpus not shared by a sibling The effective CPUs basically combine configured+available CPUs cpus.internal_effective := (cpus ∩ parent.cpus ∩ online_cpus) - passed where passed := union of children cpus whose partition is not member Finally, to handle the degraded cpusets gracefully, we define if (!degraded) cpus.effective := cpus.internal_effective else cpus.effective := parent.cpus.effective (In cases when there's no parent, we replace its cpus with online_cpus.) --- I'll try applying these conditions to your description. > + > + "cpuset.cpus" must always be set up first before enabling > + partition. This is just a transition condition. > Unlike "member" whose "cpuset.cpus.effective" can > + contain CPUs not in "cpuset.cpus", this can never happen with a > + valid partition root. In other words, "cpuset.cpus.effective" > + is always a subset of "cpuset.cpus" for a valid partition root. IIUC this refers to the cgroup that is 'degraded'. (The consequences for a valid partition root follow from valid_root definition above.) > + > + When a parent partition root cannot exclusively grant any of > + the CPUs specified in "cpuset.cpus", "cpuset.cpus.effective" > + becomes empty. This sounds too strict to me, perhaps you meant 'cannot grant _all_ of the CPUs'? > If there are tasks in the partition root, the > + partition root becomes invalid and "cpuset.cpus.effective" > + is reset to that of the nearest non-empty ancestor. This is captured in the definition of 'degraded'. > + > + Note that a task cannot be moved to a croup with empty > + "cpuset.cpus.effective". A transition condition. (Makes sense.) [With the validity conditions above, it's possible to have 'valid_root' with empty cpus (hence also empty cpus.internal_effective) if there are no tasks in there. The transition conditions so far prevented this corner case.] > + There are additional constraints on where a partition root can > + be enabled ("root" or "isolated"). It can only be enabled in > + a cgroup if all the following conditions are met. I think the enablement (aka rewriting cpuset.cpus.partition) could be always possible but it'd result in "root invalid (...)" if the resulting config doesn't meet the validity condition. > + > + 1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are > + not shared by any of its siblings. The emptiness here is a judgement call (in my formulation of the conditions it seemed simpler to allow empty cpus.internal_effective with no tasks). > + 2) The parent cgroup is a valid partition root. Captured in the valid_root definition. > + 3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus". This is unnecessary strictness. Allow such config, cpus.internal_effective still can't be more than parent's cpuset.cpus. (Or do you have a reason to discard such configs?) > + 4) There is no child cgroups with cpuset enabled. This avoids > + cpu migrations of multiple cgroups simultaneously which can > + be problematic. A transition condition (i.e. not relevant to validity conditions). > + Once becoming a partition root, changes to "cpuset.cpus" > + is generally allowed as long as the cpu list is exclusive, > + non-empty and is a superset of children's cpu lists. Any changes should be allowed otherwise it denies the delegation principle of v2 (IOW a parent should be able to preempt CPUs given to chilren previously and not be denied because of them). (If the change results in failed validity condition the cgroup of course cannot be be a valid_root anymore.) > + The constraints of a valid partition root are as follows: > + > + 1) The parent cgroup is a valid partition root. > + 2) "cpuset.cpus.effective" is a subset of "cpuset.cpus" > + 3) "cpuset.cpus.effective" is non-empty when there are tasks > + in the partition. (This seem to miss the sibling exclusivity condition.) Here I'd simply paste the "Validity conditions" specified above instead. > + Changing a partition root to "member" is always allowed. > + If there are child partition roots underneath it, however, > + they will be forced to be switched back to "member" too and > + lose their partitions. So care must be taken to double check > + for this condition before disabling a partition root. (Or is this how delegation is intended?) However, AFAICS, parent still can't remove cpuset.cpus even when the child is a "member". Otherwise, I agree with the back-switch. > + Setting a cgroup to a valid partition root will take the CPUs > + away from the effective CPUs of the parent partition. Captured in the definition of cpus.internal_effective. > + A valid parent partition may distribute out all its CPUs to > + its child partitions as long as it is not the root cgroup as > + we need some house-keeping CPUs in the root cgroup. This actually applies to any root partition that's supposed to host tasks. (IOW, 'valid_root' cannot be 'degraded'.) > + An invalid partition is not a real partition even though some > + internal states may still be kept. Tautology? (Or new definition of "real".) > + > + An invalid partition root can be reverted back to a real > + partition root if none of the constraints of a valid partition > + root are violated. Yes. (Also tautological.) Anyway, as I said above, I just tried to formulate the model for clearer understanding and the implementation may introduce transition constraints but it'd be good to always have the simple rules to tell what's a valid root in the tree and what's not. Regards, Michal [1] https://lore.kernel.org/r/AM9PR10MB4869C14EAE01B87C0037BF6A89939@AM9PR10MB4869.EURPRD10.PROD.OUTLOOK.COM/ [2] https://lore.kernel.org/lkml/5eacfdcc-148b-b599-3111-4f2971e7ddc0@redhat.com/
Hello, On Mon, Nov 15, 2021 at 08:31:22PM +0100, Michal Koutný wrote: > Now to the constraints and partition setups. I think it's useful to have > a model with which the implementation can be compared with. > I tried to condense some "simple rules" from the descriptions you posted > in v8 plus your response to my remarks in v7 [2]. These should only be > the "validity conditions", not "transition conditions". FWIW, my opinion is pretty much in line with Michal's in this regard. Other than that, everything looks pretty good to me. Thanks.
On 11/15/21 14:31, Michal Koutný wrote: > Hello. > > On Mon, Oct 18, 2021 at 10:36:18AM -0400, Waiman Long <longman@redhat.com> wrote: >> + When set to "isolated", the CPUs in that partition root will >> + be in an isolated state without any load balancing from the >> + scheduler. Tasks in such a partition must be explicitly bound >> + to each individual CPU. > This sounds reasonable but it seems to have some usability issues as was > raised in another thread [1]. (I could only think of the workaround of > single-cpu cgroup leaves + CLONE_INTO_CGROUP.) It can be a problem when one is trying to move from one cgroup to another cgroup with non-overlapping cpus laterally. However, if a task is initially from a parent cgroup with affinity mask that include cpus in the isolated child cgroup, I believe it should be able to move to the isolated child cgroup without problem. Otherwise, it is a bug that needs to be fixed. > > TL;DR Do whatever you find suitable but (re)consider sticking to the > delegation principle (making hotplug and ancestor changes equal). > > Now to the constraints and partition setups. I think it's useful to have > a model with which the implementation can be compared with. > I tried to condense some "simple rules" from the descriptions you posted > in v8 plus your response to my remarks in v7 [2]. These should only be > the "validity conditions", not "transition conditions". > > ## Validity conditions > > For simplification, there's a condition called 'degraded' that tells > whether a cpuset can host tasks (with the given config) that expands to > two predicates: > > degraded := cpus.internal_effective == ø && has_tasks > valid_root := !degraded && cpus_exclusive && parent.valid_root > (valid_member := !degraded) > > with a helping predicate > cpus_exclusive := cpus not shared by a sibling > > The effective CPUs basically combine configured+available CPUs > > cpus.internal_effective := (cpus ∩ parent.cpus ∩ online_cpus) - passed > > where > passed := union of children cpus whose partition is not member > > Finally, to handle the degraded cpusets gracefully, we define > > if (!degraded) > cpus.effective := cpus.internal_effective > else > cpus.effective := parent.cpus.effective > > (In cases when there's no parent, we replace its cpus with online_cpus.) > > --- > > I'll try applying these conditions to your description. > >> + >> + "cpuset.cpus" must always be set up first before enabling >> + partition. > This is just a transition condition. > >> Unlike "member" whose "cpuset.cpus.effective" can >> + contain CPUs not in "cpuset.cpus", this can never happen with a >> + valid partition root. In other words, "cpuset.cpus.effective" >> + is always a subset of "cpuset.cpus" for a valid partition root. > IIUC this refers to the cgroup that is 'degraded'. (The consequences for > a valid partition root follow from valid_root definition above.) > >> + >> + When a parent partition root cannot exclusively grant any of >> + the CPUs specified in "cpuset.cpus", "cpuset.cpus.effective" >> + becomes empty. > This sounds too strict to me, perhaps you meant 'cannot grant _all_ of > the CPUs'? I think the wording may be confusing. What I meant is none of the requested cpu can be granted. So if there is at least one granted, the effective cpus won't be empty. >> If there are tasks in the partition root, the >> + partition root becomes invalid and "cpuset.cpus.effective" >> + is reset to that of the nearest non-empty ancestor. > This is captured in the definition of 'degraded'. > >> + >> + Note that a task cannot be moved to a croup with empty >> + "cpuset.cpus.effective". > A transition condition. (Makes sense.) > > [With the validity conditions above, it's possible to have 'valid_root' > with empty cpus (hence also empty cpus.internal_effective) if there are > no tasks in there. The transition conditions so far prevented this > corner case.] > >> + There are additional constraints on where a partition root can >> + be enabled ("root" or "isolated"). It can only be enabled in >> + a cgroup if all the following conditions are met. > I think the enablement (aka rewriting cpuset.cpus.partition) could be > always possible but it'd result in "root invalid (...)" if the resulting > config doesn't meet the validity condition. > >> + >> + 1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are >> + not shared by any of its siblings. > The emptiness here is a judgement call (in my formulation of the > conditions it seemed simpler to allow empty cpus.internal_effective with > no tasks). There are more constraints in enabling a partition. Once it is enabled, there will be less constraints to maintain its validity. > >> + 2) The parent cgroup is a valid partition root. > Captured in the valid_root definition. > >> + 3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus". > This is unnecessary strictness. Allow such config, > cpus.internal_effective still can't be more than parent's cpuset.cpus. > (Or do you have a reason to discard such configs?) > >> + 4) There is no child cgroups with cpuset enabled. This avoids >> + cpu migrations of multiple cgroups simultaneously which can >> + be problematic. > A transition condition (i.e. not relevant to validity conditions). > >> + Once becoming a partition root, changes to "cpuset.cpus" >> + is generally allowed as long as the cpu list is exclusive, >> + non-empty and is a superset of children's cpu lists. > Any changes should be allowed otherwise it denies the delegation > principle of v2 (IOW a parent should be able to preempt CPUs given to > chilren previously and not be denied because of them). > > (If the change results in failed validity condition the cgroup of course > cannot be be a valid_root anymore.) > >> + The constraints of a valid partition root are as follows: >> + >> + 1) The parent cgroup is a valid partition root. >> + 2) "cpuset.cpus.effective" is a subset of "cpuset.cpus" >> + 3) "cpuset.cpus.effective" is non-empty when there are tasks >> + in the partition. > (This seem to miss the sibling exclusivity condition.) > Here I'd simply paste the "Validity conditions" specified above instead. You currently cannot make change to cpuset.cpus that violates the cpu exclusivity rule. The above constraints will not disallow you to make the change. They just affect the validity of the partition root. > >> + Changing a partition root to "member" is always allowed. >> + If there are child partition roots underneath it, however, >> + they will be forced to be switched back to "member" too and >> + lose their partitions. So care must be taken to double check >> + for this condition before disabling a partition root. > (Or is this how delegation is intended?) However, AFAICS, parent still > can't remove cpuset.cpus even when the child is a "member". Otherwise, > I agree with the back-switch. There are only 2 possibilities here. Either we force the child partitions to be become members or invalid partition root. The purpose of invalid partition root is actually a transient state which can be recovered in some way to make the partition again. However, changing a parent partition root to member breaks the possibility of recovering later. That is why I think it is more sensible to force those child partitions to become members. > > >> + Setting a cgroup to a valid partition root will take the CPUs >> + away from the effective CPUs of the parent partition. > Captured in the definition of cpus.internal_effective. > >> + A valid parent partition may distribute out all its CPUs to >> + its child partitions as long as it is not the root cgroup as >> + we need some house-keeping CPUs in the root cgroup. > This actually applies to any root partition that's supposed to host > tasks. (IOW, 'valid_root' cannot be 'degraded'.) > >> + An invalid partition is not a real partition even though some >> + internal states may still be kept. > Tautology? (Or new definition of "real".) > >> + >> + An invalid partition root can be reverted back to a real >> + partition root if none of the constraints of a valid partition >> + root are violated. > Yes. (Also tautological.) > > Anyway, as I said above, I just tried to formulate the model for clearer > understanding and the implementation may introduce transition > constraints but it'd be good to always have the simple rules to tell > what's a valid root in the tree and what's not. Thanks for analyzing each statements for their validity. I will try to improve it to make it easier to understand. Cheers, Longman
On 11/15/21 15:11, Tejun Heo wrote: > Hello, > > On Mon, Nov 15, 2021 at 08:31:22PM +0100, Michal Koutný wrote: >> Now to the constraints and partition setups. I think it's useful to have >> a model with which the implementation can be compared with. >> I tried to condense some "simple rules" from the descriptions you posted >> in v8 plus your response to my remarks in v7 [2]. These should only be >> the "validity conditions", not "transition conditions". > FWIW, my opinion is pretty much in line with Michal's in this regard. Other > than that, everything looks pretty good to me. Yes, I am going to streamline the documentation as suggested to make it easier to understand. Coding-wise, do you have other changes you want me to make? Thanks, Longman
On Mon, Nov 15, 2021 at 04:10:29PM -0500, Waiman Long <longman@redhat.com> wrote: > > On Mon, Oct 18, 2021 at 10:36:18AM -0400, Waiman Long <longman@redhat.com> wrote: > > > + scheduler. Tasks in such a partition must be explicitly bound > > > + to each individual CPU. > [...] > > It can be a problem when one is trying to move from one cgroup to another > cgroup with non-overlapping cpus laterally. However, if a task is initially > from a parent cgroup with affinity mask that include cpus in the isolated > child cgroup, I believe it should be able to move to the isolated child > cgroup without problem. Otherwise, it is a bug that needs to be fixed. app_root cpuset.cpus=0-3 `- non_rt cpuset.cpus=0-1 cpuset.cpus.partition=member `- rt cpuset.cpus=2-3 cpuset.cpus.partition=isolated The app_root would have cpuset.cpus.effective=0-1 so even the task in app_root can't sched_setaffinity() to cpus 2-3. But AFAICS, the migration calls set_cpus_allowed_ptr() anyway, so the task in the isolated partition needn't to bind explicitly with sched_setaffinity(). (It'd have two cpus available, so one more sched_setaffinity() or migration into a single-cpu list is desirable.) All in all, I think the behavior is OK and the explicit binding of tasks in an isolated cpuset is optional (not a must as worded currently). > I think the wording may be confusing. What I meant is none of the requested > cpu can be granted. So if there is at least one granted, the effective cpus > won't be empty. Ack. > You currently cannot make change to cpuset.cpus that violates the cpu > exclusivity rule. The above constraints will not disallow you to make the > change. They just affect the validity of the partition root. Sibling exclusivity should be a validity condition regardless of whether transition is allowed or not. (At least it looks simpler to me.) > > > + Changing a partition root to "member" is always allowed. > > > + If there are child partition roots underneath it, however, > > > + they will be forced to be switched back to "member" too and > > > + lose their partitions. So care must be taken to double check > > > + for this condition before disabling a partition root. > > (Or is this how delegation is intended?) However, AFAICS, parent still > > can't remove cpuset.cpus even when the child is a "member". Otherwise, > > I agree with the back-switch. > There are only 2 possibilities here. Either we force the child partitions to > be become members or invalid partition root. My point here was mostly about preempting the cpus (as a v2 specific feature). (I'm rather indifferent whether children turn into invalid roots or members.) Thanks, Michal
On 11/16/21 12:54, Michal Koutný wrote: > On Mon, Nov 15, 2021 at 04:10:29PM -0500, Waiman Long <longman@redhat.com> wrote: >>> On Mon, Oct 18, 2021 at 10:36:18AM -0400, Waiman Long <longman@redhat.com> wrote: >>>> + scheduler. Tasks in such a partition must be explicitly bound >>>> + to each individual CPU. >>>> [...] >>>> >>>> It can be a problem when one is trying to move from one cgroup to another >>>> cgroup with non-overlapping cpus laterally. However, if a task is initially >>>> from a parent cgroup with affinity mask that include cpus in the isolated >>>> child cgroup, I believe it should be able to move to the isolated child >>>> cgroup without problem. Otherwise, it is a bug that needs to be fixed. > app_root cpuset.cpus=0-3 > `- non_rt cpuset.cpus=0-1 cpuset.cpus.partition=member > `- rt cpuset.cpus=2-3 cpuset.cpus.partition=isolated > > The app_root would have cpuset.cpus.effective=0-1 so even the task in > app_root can't sched_setaffinity() to cpus 2-3. > But AFAICS, the migration calls set_cpus_allowed_ptr() anyway, so the > task in the isolated partition needn't to bind explicitly with > sched_setaffinity(). (It'd have two cpus available, so one more > sched_setaffinity() or migration into a single-cpu list is desirable.) > > All in all, I think the behavior is OK and the explicit binding of tasks > in an isolated cpuset is optional (not a must as worded currently). > > >> I think the wording may be confusing. What I meant is none of the requested >> cpu can be granted. So if there is at least one granted, the effective cpus >> won't be empty. > Ack. > >> You currently cannot make change to cpuset.cpus that violates the cpu >> exclusivity rule. The above constraints will not disallow you to make the >> change. They just affect the validity of the partition root. > Sibling exclusivity should be a validity condition regardless of whether > transition is allowed or not. (At least it looks simpler to me.) > > >>>> + Changing a partition root to "member" is always allowed. >>>> + If there are child partition roots underneath it, however, >>>> + they will be forced to be switched back to "member" too and >>>> + lose their partitions. So care must be taken to double check >>>> + for this condition before disabling a partition root. >>> (Or is this how delegation is intended?) However, AFAICS, parent still >>> can't remove cpuset.cpus even when the child is a "member". Otherwise, >>> I agree with the back-switch. >> There are only 2 possibilities here. Either we force the child partitions to >> be become members or invalid partition root. > My point here was mostly about preempting the cpus (as a v2 specific > feature). (I'm rather indifferent whether children turn into invalid > roots or members.) Below is my latest iterations of the cpuset.cpus.partition documentation. If there is no objection or other suggestion for improvement, I am going to send out another iteration of the patch series with the updated documentation. Cheers, Longman -------------------------------------------------------------- cpuset.cpus.partition A read-write single value file which exists on non-root cpuset-enabled cgroups. This flag is owned by the parent cgroup and is not delegatable. It accepts only the following input values when written to. ======== ================================ "member" Non-root member of a partition "root" Partition root "isolated" Partition root without load balancing ======== ================================ The root cgroup is always a partition root and its state cannot be changed. All other non-root cgroups start out as "member". When set to "root", the current cgroup is the root of a new partition or scheduling domain that comprises itself and all its descendants except those that are separate partition roots themselves and their descendants. The value shown in "cpuset.cpus.effective" of a partition root is the CPUs that the parent partition root can dedicate to the new partition root. They are subtracted from "cpuset.cpus.effective" of the parent and may be different from "cpuset.cpus" When set to "isolated", the CPUs in that partition root will be in an isolated state without any load balancing from the scheduler. Tasks placed in such a partition with multiple CPUs should be carefully distributed and bound to each of the individual CPUs for optimal performance. A partition root ("root" or "isolated") can be in one of the two possible states - valid or invalid. An invalid partition root is in a degraded state where some state information are retained, but behaves more like a "member". On read, the "cpuset.cpus.partition" file can show the following values. ====================== ============================== "member" Non-root member of a partition "root" Partition root "isolated" Partition root without load balancing "root invalid (<reason>)" Invalid partition root ====================== ============================== In the case of an invalid partition root, a descriptive string on why the partition is invalid is included within parentheses. Almost all possible state transitions among "member", valid and invalid partition roots are allowed except from "member" to invalid partition root. Before the "member" to partition root transition can happen, the following conditions must be met or the transition will not be allowed. 1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are not shared by any of its siblings. 2) The parent cgroup is a valid partition root. 3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus". 4) There is no child cgroups with cpuset enabled. This avoids cpu migrations of multiple cgroups simultaneously which can be problematic. Once becoming a partition root, the following two rules restrict what changes can be made to "cpuset.cpus". 1) The value must be exclusive. 2) If child cpusets exist, the value must be a superset of what are defined in the child cpusets. The second rule applies even for "member". Other changes to "cpuset.cpus" that do not violate the above rules are always allowed. External events like hotplug or inappropriate changes to "cpuset.cpus" can cause a valid partition root to become invalid. Besides the constraints on changing "cpuset.cpus" listed above, the other conditions required to maintain the validity of a partition root are as follows: 1) The parent cgroup is a valid partition root. 2) If "cpuset.cpus.effective" is empty, the partition must have no task associated with it. Otherwise, the partition becomes invalid and "cpuset.cpus.effective" will fall back to that of the nearest non-empty ancestor. A corollary of a valid partition root is that "cpuset.cpu.effective" is always a subset of "cpuset.cpus". Note that a task cannot be moved to a cgroup with empty "cpuset.cpus.effective". Changing a partition root (valid or invalid) to "member" is always allowed. If there are child partition roots underneath it, however, they will be forced to be switched back to "member" too and lose their partitions. So care must be taken to double check for this condition before disabling a partition root. A valid parent partition may distribute out all its CPUs to its child partitions as long as it is not the root cgroup and there is no task associated with it. An invalid partition root can be reverted back to a valid one if none of the validity constraints of a valid partition root are violated. Poll and inotify events are triggered whenever the state of "cpuset.cpus.partition" changes. That includes changes caused by write to "cpuset.cpus.partition", cpu hotplug and other changes that make the partition invalid. This will allow user space agents to monitor unexpected changes to "cpuset.cpus.partition" without the need to do continuous polling.
Hello, Waiman. On Tue, Nov 30, 2021 at 10:35:19AM -0500, Waiman Long wrote: > On read, the "cpuset.cpus.partition" file can show the following > values. > > ====================== ============================== > "member" Non-root member of a partition > "root" Partition root > "isolated" Partition root without load balancing > "root invalid (<reason>)" Invalid partition root > ====================== ============================== What happens if an isolated domain becomes invalid and then valid again due to cpu hotplug? Does it go "root invalid" and then back to "isolated"? ... > Before the "member" to partition root transition can happen, > the following conditions must be met or the transition will > not be allowed. > > 1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are > not shared by any of its siblings. > 2) The parent cgroup is a valid partition root. > 3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus". > 4) There is no child cgroups with cpuset enabled. This avoids > cpu migrations of multiple cgroups simultaneously which can > be problematic. So, I still have a hard time justifying the above restrictions. 1) can be broken through hotplug anyway. 2) can be broken by the parent switching to member. 3) would mean that we'd need to restrict parent's config changes depending on what children are doing. 4) is more understandable but it's an implementation detail that we can address in the future. > Once becoming a partition root, the following two rules restrict > what changes can be made to "cpuset.cpus". > > 1) The value must be exclusive. > 2) If child cpusets exist, the value must be a superset of what > are defined in the child cpusets. > > The second rule applies even for "member". Other changes to > "cpuset.cpus" that do not violate the above rules are always > allowed. While it isn't necessarily tied to this series, it's a big no-no to restrict what a parent can do depending on what its descendants are doing. A cgroup higher up in the hierarchy should be able to change configuration however it sees fit as deligation breaks down otherwise. Maybe you can argue that cpuset is special and shouldn't be subject to such convention but I can't see strong enough justifications especially given that most of these restrictions can be broken by hotplug operations anyway and thus need code to handle those situations. > Changing a partition root (valid or invalid) to "member" is > always allowed. If there are child partition roots underneath > it, however, they will be forced to be switched back to "member" > too and lose their partitions. So care must be taken to double > check for this condition before disabling a partition root. Wouldn't it make more sense for them to retain their configuration and turn invalid? Why is this special? > A valid parent partition may distribute out all its CPUs to > its child partitions as long as it is not the root cgroup and > there is no task associated with it. A valid parent partition which isn't root never has tasks in them to begin with. > An invalid partition root can be reverted back to a valid one > if none of the validity constraints of a valid partition root > are violated. > > Poll and inotify events are triggered whenever the state of > "cpuset.cpus.partition" changes. That includes changes caused by > write to "cpuset.cpus.partition", cpu hotplug and other changes > that make the partition invalid. This will allow user space > agents to monitor unexpected changes to "cpuset.cpus.partition" > without the need to do continuous polling. Unfortunately, my sense is still that both the restrictions and behaviors are pretty arbitrary. I can somewhat see how the restrictions may make sense in a specific frame of mind but am having a hard time finding strong enough justifications for them. There are many really specific rules and it isn't clear why they are the way they are. Thanks.
On 11/30/21 12:11, Tejun Heo wrote: > Hello, Waiman. > > On Tue, Nov 30, 2021 at 10:35:19AM -0500, Waiman Long wrote: >> On read, the "cpuset.cpus.partition" file can show the following >> values. >> >> ====================== ============================== >> "member" Non-root member of a partition >> "root" Partition root >> "isolated" Partition root without load balancing >> "root invalid (<reason>)" Invalid partition root >> ====================== ============================== > What happens if an isolated domain becomes invalid and then valid again due > to cpu hotplug? Does it go "root invalid" and then back to "isolated"? Yes, the current code allow recovering from an invalid state. In this particular case, the transition will be "isolated" --> "root invalid" --> "isolated". > ... >> Before the "member" to partition root transition can happen, >> the following conditions must be met or the transition will >> not be allowed. >> >> 1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are >> not shared by any of its siblings. >> 2) The parent cgroup is a valid partition root. >> 3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus". >> 4) There is no child cgroups with cpuset enabled. This avoids >> cpu migrations of multiple cgroups simultaneously which can >> be problematic. > So, I still have a hard time justifying the above restrictions. 1) can be > broken through hotplug anyway. 2) can be broken by the parent switching to > member. 3) would mean that we'd need to restrict parent's config changes > depending on what children are doing. 4) is more understandable but it's an > implementation detail that we can address in the future. > The initial transition to a partition root has a higher barrier. Once it becomes a partition root. Some restrictions are relaxed. >> Once becoming a partition root, the following two rules restrict >> what changes can be made to "cpuset.cpus". >> >> 1) The value must be exclusive. >> 2) If child cpusets exist, the value must be a superset of what >> are defined in the child cpusets. >> >> The second rule applies even for "member". Other changes to >> "cpuset.cpus" that do not violate the above rules are always >> allowed. > While it isn't necessarily tied to this series, it's a big no-no to restrict > what a parent can do depending on what its descendants are doing. A cgroup > higher up in the hierarchy should be able to change configuration however it > sees fit as deligation breaks down otherwise. > > Maybe you can argue that cpuset is special and shouldn't be subject to such > convention but I can't see strong enough justifications especially given > that most of these restrictions can be broken by hotplug operations anyway > and thus need code to handle those situations. These are all pre-existing restrictions before the introduction of partition. These are checks done in validate_change(). I am just saying out loud the existing behavior. If you think that needs to be changed, I am fine with that. However, it will be a separate patch as it is not a behavior that is introduced by this series. >> Changing a partition root (valid or invalid) to "member" is >> always allowed. If there are child partition roots underneath >> it, however, they will be forced to be switched back to "member" >> too and lose their partitions. So care must be taken to double >> check for this condition before disabling a partition root. > Wouldn't it make more sense for them to retain their configuration and turn > invalid? Why is this special? Once an invalid partition is changed to "member", there is no way for a child invalid partition root to recover and become valid again. There is why I force them to become "member" also. I am OK if you believe it is better to keep them in the invalid state forever until we explicitly changed them to "member" eventually. > >> A valid parent partition may distribute out all its CPUs to >> its child partitions as long as it is not the root cgroup and >> there is no task associated with it. > A valid parent partition which isn't root never has tasks in them to begin > with. I believe there is some corner cases where it is possible to put task in an intermediate partition. That is why I put down this statement. Cheers, Longman
On Tue, Nov 30, 2021 at 10:56:34PM -0500, Waiman Long <longman@redhat.com> wrote: > > > A valid parent partition may distribute out all its CPUs to > > > its child partitions as long as it is not the root cgroup and > > > there is no task associated with it. > > A valid parent partition which isn't root never has tasks in them to begin > > with. > I believe there is some corner cases where it is possible to put task in an > intermediate partition. That is why I put down this statement. Just mind the threads -- cpuset controller is threaded and having tasks in inner cgroup nodes is a real scenario. I wouldn't consider it a corner case. [ Actually, the paragraph could IMO be simplified: > A valid parent partition may distribute out all its CPUs to > its child partitions as long as there is no task associated with it. Assuming there's always at least one kernel thread in the root cgroup that can't be migrated anyway.] Michal
On 11/30/21 22:56, Waiman Long wrote: > On 11/30/21 12:11, Tejun Heo wrote: > > >>> Once becoming a partition root, the following two rules restrict >>> what changes can be made to "cpuset.cpus". >>> >>> 1) The value must be exclusive. >>> 2) If child cpusets exist, the value must be a superset of what >>> are defined in the child cpusets. >>> >>> The second rule applies even for "member". Other changes to >>> "cpuset.cpus" that do not violate the above rules are always >>> allowed. >> While it isn't necessarily tied to this series, it's a big no-no to >> restrict >> what a parent can do depending on what its descendants are doing. A >> cgroup >> higher up in the hierarchy should be able to change configuration >> however it >> sees fit as deligation breaks down otherwise. >> >> Maybe you can argue that cpuset is special and shouldn't be subject >> to such >> convention but I can't see strong enough justifications especially given >> that most of these restrictions can be broken by hotplug operations >> anyway >> and thus need code to handle those situations. > > These are all pre-existing restrictions before the introduction of > partition. These are checks done in validate_change(). I am just > saying out loud the existing behavior. If you think that needs to be > changed, I am fine with that. However, it will be a separate patch as > it is not a behavior that is introduced by this series. Of the 2 restrictions listed above, the exclusivity rule is due to the use of CS_CPU_EXCLUSIVE flag. I think it is reasonable as it affects only siblings, not the parent. The second restriction was found during my testing. It is caused by the following code in validate_change(): /* Each of our child cpusets must be a subset of us */ ret = -EBUSY; cpuset_for_each_child(c, css, cur) if (!is_cpuset_subset(c, trial)) goto out; It seems that this code was there since v2.6.12 (the beginning of the git era). Later in function, we have /* On legacy hierarchy, we must be a subset of our parent cpuset. */ ret = -EACCES; if (!is_in_v2_mode() && !is_cpuset_subset(trial, par)) goto out; This is actually a duplicate in the case of legacy hierarchy. I can add a patch to take out the first code block above which I think is where most of your objections are. Then I can remove the 2nd restriction in my documentation. I would like to emphasize that this is a pre-existing behavior which I just happen to document. Cheers, Longman
On 12/1/21 09:13, Michal Koutný wrote: > On Tue, Nov 30, 2021 at 10:56:34PM -0500, Waiman Long <longman@redhat.com> wrote: >>>> A valid parent partition may distribute out all its CPUs to >>>> its child partitions as long as it is not the root cgroup and >>>> there is no task associated with it. >>> A valid parent partition which isn't root never has tasks in them to begin >>> with. >> I believe there is some corner cases where it is possible to put task in an >> intermediate partition. That is why I put down this statement. > Just mind the threads -- cpuset controller is threaded and having tasks > in inner cgroup nodes is a real scenario. I wouldn't consider it a > corner case. > > [ Actually, the paragraph could IMO be simplified: Right, I shouldn't say corner cases. Having task in an intermediate partition is possible depending on event sequence. I am aware that there are code in the cpuset code to prevent that, but it didn't block all cases. >> A valid parent partition may distribute out all its CPUs to >> its child partitions as long as there is no task associated with it. > Assuming there's always at least one kernel thread in the root cgroup > that can't be migrated anyway.] I am aware of that. That is why I said root cgroup must have at least one cpu in its "cpuset.cpus.effective". Cheers, Longman
On Wed, Dec 01, 2021 at 09:56:21AM -0500, Waiman Long wrote: > Right, I shouldn't say corner cases. Having task in an intermediate > partition is possible depending on event sequence. I am aware that there are > code in the cpuset code to prevent that, but it didn't block all cases. > > > A valid parent partition may distribute out all its CPUs to > > > its child partitions as long as there is no task associated with it. > > Assuming there's always at least one kernel thread in the root cgroup > > that can't be migrated anyway.] > > I am aware of that. That is why I said root cgroup must have at least one > cpu in its "cpuset.cpus.effective". In that case, let's explicitly describe that condition. Thanks.
Hello, Waiman. On Tue, Nov 30, 2021 at 10:56:34PM -0500, Waiman Long wrote: > > What happens if an isolated domain becomes invalid and then valid again due > > to cpu hotplug? Does it go "root invalid" and then back to "isolated"? > > Yes, the current code allow recovering from an invalid state. In this > particular case, the transition will be "isolated" --> "root invalid" --> > "isolated". Wouldn't it be clearer if it became "isolated invalid"? > > While it isn't necessarily tied to this series, it's a big no-no to restrict > > what a parent can do depending on what its descendants are doing. A cgroup > > higher up in the hierarchy should be able to change configuration however it > > sees fit as deligation breaks down otherwise. > > > > Maybe you can argue that cpuset is special and shouldn't be subject to such > > convention but I can't see strong enough justifications especially given > > that most of these restrictions can be broken by hotplug operations anyway > > and thus need code to handle those situations. > > These are all pre-existing restrictions before the introduction of > partition. These are checks done in validate_change(). I am just saying out > loud the existing behavior. If you think that needs to be changed, I am fine > with that. However, it will be a separate patch as it is not a behavior that > is introduced by this series. I see. It looks more problematic now with the addtion of the state transition error reporting, more possible state transitions and, well, actual documentation. > Once an invalid partition is changed to "member", there is no way for a > child invalid partition root to recover and become valid again. There is why > I force them to become "member" also. I am OK if you believe it is better to > keep them in the invalid state forever until we explicitly changed them to > "member" eventually. That's because we don't allow turning a cgroup with descendants into a partition, right? So, when we were first adding the partition support, the thinking was that as it's pretty niche anyway, we can take some aberrations and restrictions, but I don't think it's a good direction to be building up on top of those like this and would much prefer to clean up the rules and restrictions. I know that this has been going on for quite a while and am sorry that am coming back to the same issue repeatedly which isn't necessarily caused by the proposed change. What do you think? Thanks.
On 12/1/21 11:39, Tejun Heo wrote: > On Wed, Dec 01, 2021 at 09:56:21AM -0500, Waiman Long wrote: >> Right, I shouldn't say corner cases. Having task in an intermediate >> partition is possible depending on event sequence. I am aware that there are >> code in the cpuset code to prevent that, but it didn't block all cases. >>>> A valid parent partition may distribute out all its CPUs to >>>> its child partitions as long as there is no task associated with it. >>> Assuming there's always at least one kernel thread in the root cgroup >>> that can't be migrated anyway.] >> I am aware of that. That is why I said root cgroup must have at least one >> cpu in its "cpuset.cpus.effective". > In that case, let's explicitly describe that condition. Yes, I will. Only non-root cgroup can distribute out all its CPUs. I thought I said that in the documentation, maybe it is very clear. Cheers, Longman
On 12/1/21 11:46, Tejun Heo wrote: > Hello, Waiman. > > On Tue, Nov 30, 2021 at 10:56:34PM -0500, Waiman Long wrote: >>> What happens if an isolated domain becomes invalid and then valid again due >>> to cpu hotplug? Does it go "root invalid" and then back to "isolated"? >> Yes, the current code allow recovering from an invalid state. In this >> particular case, the transition will be "isolated" --> "root invalid" --> >> "isolated". > Wouldn't it be clearer if it became "isolated invalid"? You are right. I have overlooked that. Will make the change. > >>> While it isn't necessarily tied to this series, it's a big no-no to restrict >>> what a parent can do depending on what its descendants are doing. A cgroup >>> higher up in the hierarchy should be able to change configuration however it >>> sees fit as deligation breaks down otherwise. >>> >>> Maybe you can argue that cpuset is special and shouldn't be subject to such >>> convention but I can't see strong enough justifications especially given >>> that most of these restrictions can be broken by hotplug operations anyway >>> and thus need code to handle those situations. >> These are all pre-existing restrictions before the introduction of >> partition. These are checks done in validate_change(). I am just saying out >> loud the existing behavior. If you think that needs to be changed, I am fine >> with that. However, it will be a separate patch as it is not a behavior that >> is introduced by this series. > I see. It looks more problematic now with the addtion of the state > transition error reporting, more possible state transitions and, well, > actual documentation. I am going to add a patch to take out the child superset limitation for the default hierarchy as I believe it is probably an oversight that we were not aware of before. I would like to keep the exclusivity rule though as I think it makes sense. > >> Once an invalid partition is changed to "member", there is no way for a >> child invalid partition root to recover and become valid again. There is why >> I force them to become "member" also. I am OK if you believe it is better to >> keep them in the invalid state forever until we explicitly changed them to >> "member" eventually. > That's because we don't allow turning a cgroup with descendants into a > partition, right? Yes, that is a major part of it. > > So, when we were first adding the partition support, the thinking was that > as it's pretty niche anyway, we can take some aberrations and restrictions, > but I don't think it's a good direction to be building up on top of those > like this and would much prefer to clean up the rules and restrictions. I > know that this has been going on for quite a while and am sorry that am > coming back to the same issue repeatedly which isn't necessarily caused by > the proposed change. What do you think? I think I can relax some of the restrictions, but probably not all of them at this time. We can certainly working on removing as much restriction and limitations as possible in future update to the partition code. Cheers, Longman
On 12/1/21 13:05, Waiman Long wrote: > > On 12/1/21 11:46, Tejun Heo wrote: >> >> So, when we were first adding the partition support, the thinking was >> that >> as it's pretty niche anyway, we can take some aberrations and >> restrictions, >> but I don't think it's a good direction to be building up on top of >> those >> like this and would much prefer to clean up the rules and >> restrictions. I >> know that this has been going on for quite a while and am sorry that am >> coming back to the same issue repeatedly which isn't necessarily >> caused by >> the proposed change. What do you think? > > I think I can relax some of the restrictions, but probably not all of > them at this time. We can certainly working on removing as much > restriction and limitations as possible in future update to the > partition code. I would say that partition is a cpuset feature that only a minority of users may ever need to use. What I don't want to do is to make the partition feature as general and accommodating as possible and then some of them become dead code that people never use. It won't break binary compatibility to relax or remove limitations in the future. However, imposing new limitation or restriction in the future may not be possible. So I would like to see new use cases evolve that require us to remove the limitations. If that happens, I am happy to update the code to accommodate the new use cases. For the current use cases of partition that I am aware of, the current limitations as documented will not be a problem for those use cases. The document below is my latest draft of the document. There are several major changes from the earlier draft: 1) The limitation that "cpuset.cpus" has to be a superset of child's "cpuset.cpus" has been removed as a new patch to remove that limitation will be added. 2) The initial transition from "member" to partition root now requires that "cpuset.cpus" overlap with that of the parent's "cpuset.cpus" instead of being a superset. 3) Now read back of "cpuset.cpus.partition" may return "isolated invalid". For the transition back to "member", I haven't changed the current wording of forcing child partition roots to become "member" yet. If you think keeping them as invalid partition root is better, I can made that change too. Please let me know what other changes you would like to see. Cheers, Longman ---------------------------------------------------------------------------------------- cpuset.cpus.partition A read-write single value file which exists on non-root cpuset-enabled cgroups. This flag is owned by the parent cgroup and is not delegatable. It accepts only the following input values when written to. ======== ================================ "member" Non-root member of a partition "root" Partition root "isolated" Partition root without load balancing ======== ================================ The root cgroup is always a partition root and its state cannot be changed. All other non-root cgroups start out as "member". When set to "root", the current cgroup is the root of a new partition or scheduling domain that comprises itself and all its descendants except those that are separate partition roots themselves and their descendants. The value shown in "cpuset.cpus.effective" of a partition root is the CPUs that the parent partition root can dedicate to the new partition root. They are subtracted from "cpuset.cpus.effective" of the parent and may be different from "cpuset.cpus" When set to "isolated", the CPUs in that partition root will be in an isolated state without any load balancing from the scheduler. Tasks placed in such a partition with multiple CPUs should be carefully distributed and bound to each of the individual CPUs for optimal performance. A partition root ("root" or "isolated") can be in one of the two possible states - valid or invalid. An invalid partition root is in a degraded state where some state information are retained, but behaves more like a "member". On read, the "cpuset.cpus.partition" file can show the following values. ====================== ============================== "member" Non-root member of a partition "root" Partition root "isolated" Partition root without load balancing "root invalid (<reason>)" Invalid partition root "isolated invalid (<reason>)" Invalid isolated partition root ====================== ============================== In the case of an invalid partition root, a descriptive string on why the partition is invalid is included within parentheses. Almost all possible state transitions among "member", valid and invalid partition roots are allowed except from "member" to invalid partition root. Before the "member" to partition root transition can happen, the following conditions must be met or the transition will not be allowed. 1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are not shared by any of its siblings. 2) The parent cgroup is a valid partition root. 3) The "cpuset.cpus" must contain at least one of the CPUs from parent's "cpuset.cpus", i.e. they overlap. 4) There is no child cgroups with cpuset enabled. This avoids cpu migrations of multiple cgroups simultaneously which can be problematic. Once becoming a partition root, the only rule restricting changes made to "cpuset.cpus" is the exclusivity rule where none of the siblings of a partition root can share CPUs with it. External events like hotplug or inappropriate changes to "cpuset.cpus" can cause a valid partition root to become invalid. Besides the exclusivity rule listed above, the other conditions required to maintain the validity of a partition root are as follows: 1) The parent cgroup is a valid partition root. 2) If "cpuset.cpus.effective" is empty, the partition must have no task associated with it. Otherwise, the partition becomes invalid and "cpuset.cpus.effective" will fall back to that of the nearest non-empty ancestor. A corollary of a valid partition root is that "cpuset.cpus.effective" is always a subset of "cpuset.cpus". Note that a task cannot be moved to a cgroup with empty "cpuset.cpus.effective". Changing a partition root (valid or invalid) to "member" is always allowed. If there are child partition roots underneath it, however, they will be forced to be switched back to "member" too and lose their partitions. So care must be taken to double check for this condition before disabling a partition root. A valid non-root parent partition may distribute out all its CPUs to its child partitions when there is no task associated with it. An invalid partition root can be reverted back to a valid one if none of the validity constraints of a valid partition root are violated. Poll and inotify events are triggered whenever the state of "cpuset.cpus.partition" changes. That includes changes caused by write to "cpuset.cpus.partition", cpu hotplug and other changes that make the partition invalid. This will allow user space agents to monitor unexpected changes to "cpuset.cpus.partition" without the need to do continuous polling.
Hello Longman. On Wed, Dec 01, 2021 at 08:28:09PM -0500, Waiman Long <longman@redhat.com> wrote: > 1) The limitation that "cpuset.cpus" has to be a superset of child's > "cpuset.cpus" has been removed as a new patch to remove that limitation will > be added. Superb! > 2) The initial transition from "member" to partition root now requires that > "cpuset.cpus" overlap with that of the parent's "cpuset.cpus" instead of > being a superset. That's sensible. > For the transition back to "member", I haven't changed the current wording > of forcing child partition roots to become "member" yet. If you think > keeping them as invalid partition root is better, I can made that change > too. I wrote I was indifferent about this in a previous mail but when I think about it now, switching to invalid root is perhaps better than switching to member since it'd effectively mean that modifications of the parent config propagate (permanently) also to a descendant config, which is an undesired v1-ism. > Please let me know what other changes you would like to see. I hope my remarks below are just clarifications and not substantial changes. Besides that I find your new draft good. Thanks! > [...] > An invalid partition root can be reverted back to a valid one > if none of the validity constraints of a valid partition root > are violated. s/can be/will be/ (I understand the intention is to make it asynchronously and automatically, i.e. without writing into the affected descendant(s) cpuset.partition again.) > Poll and inotify events are triggered whenever the state of > "cpuset.cpus.partition" changes. That includes changes caused by > write to "cpuset.cpus.partition", cpu hotplug and other changes > that make the partition invalid. -> that change validity status (In accordance with the comment above.) Michal
On 12/3/21 13:25, Michal Koutný wrote: > Hello Longman. > > On Wed, Dec 01, 2021 at 08:28:09PM -0500, Waiman Long <longman@redhat.com> wrote: >> 1) The limitation that "cpuset.cpus" has to be a superset of child's >> "cpuset.cpus" has been removed as a new patch to remove that limitation will >> be added. > Superb! > >> 2) The initial transition from "member" to partition root now requires that >> "cpuset.cpus" overlap with that of the parent's "cpuset.cpus" instead of >> being a superset. > That's sensible. > >> For the transition back to "member", I haven't changed the current wording >> of forcing child partition roots to become "member" yet. If you think >> keeping them as invalid partition root is better, I can made that change >> too. > I wrote I was indifferent about this in a previous mail but when I think > about it now, switching to invalid root is perhaps better than switching > to member since it'd effectively mean that modifications of the parent > config propagate (permanently) also to a descendant config, which is > an undesired v1-ism. That makes sense. I will keep those child partitions in an invalid state then. > >> Please let me know what other changes you would like to see. > I hope my remarks below are just clarifications and not substantial > changes. Besides that I find your new draft good. Thanks! > >> [...] >> An invalid partition root can be reverted back to a valid one >> if none of the validity constraints of a valid partition root >> are violated. > s/can be/will be/ > > (I understand the intention is to make it asynchronously and > automatically, i.e. without writing into the affected descendant(s) > cpuset.partition again.) Yes, that will be automatic and the partition will become valid again if other events cause changes that unbreak the validity constraints. > >> Poll and inotify events are triggered whenever the state of >> "cpuset.cpus.partition" changes. That includes changes caused by >> write to "cpuset.cpus.partition", cpu hotplug and other changes >> that make the partition invalid. > -> that change validity status > > (In accordance with the comment above.) Cheers, Longman
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 4d8c27eca96b..40d39562a8dd 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -2091,8 +2091,9 @@ Cpuset Interface Files It accepts only the following input values when written to. ======== ================================ - "root" a partition root - "member" a non-root member of a partition + "member" Non-root member of a partition + "root" Partition root + "isolated" Partition root without load balancing ======== ================================ When set to be a partition root, the current cgroup is the @@ -2101,64 +2102,96 @@ Cpuset Interface Files partition roots themselves and their descendants. The root cgroup is always a partition root. - There are constraints on where a partition root can be set. - It can only be set in a cgroup if all the following conditions - are true. - - 1) The "cpuset.cpus" is not empty and the list of CPUs are - exclusive, i.e. they are not shared by any of its siblings. - 2) The parent cgroup is a partition root. - 3) The "cpuset.cpus" is also a proper subset of the parent's - "cpuset.cpus.effective". - 4) There is no child cgroups with cpuset enabled. This is for - eliminating corner cases that have to be handled if such a - condition is allowed. - - Setting it to partition root will take the CPUs away from the - effective CPUs of the parent cgroup. Once it is set, this - file cannot be reverted back to "member" if there are any child - cgroups with cpuset enabled. - - A parent partition cannot distribute all its CPUs to its - child partitions. There must be at least one cpu left in the - parent partition. - - Once becoming a partition root, changes to "cpuset.cpus" is - generally allowed as long as the first condition above is true, - the change will not take away all the CPUs from the parent - partition and the new "cpuset.cpus" value is a superset of its - children's "cpuset.cpus" values. - - Sometimes, external factors like changes to ancestors' - "cpuset.cpus" or cpu hotplug can cause the state of the partition - root to change. On read, the "cpuset.sched.partition" file - can show the following values. - - ============== ============================== - "member" Non-root member of a partition - "root" Partition root - "root invalid" Invalid partition root - ============== ============================== - - It is a partition root if the first 2 partition root conditions - above are true and at least one CPU from "cpuset.cpus" is - granted by the parent cgroup. - - A partition root can become invalid if none of CPUs requested - in "cpuset.cpus" can be granted by the parent cgroup or the - parent cgroup is no longer a partition root itself. In this - case, it is not a real partition even though the restriction - of the first partition root condition above will still apply. - The cpu affinity of all the tasks in the cgroup will then be - associated with CPUs in the nearest ancestor partition. - - An invalid partition root can be transitioned back to a - real partition root if at least one of the requested CPUs - can now be granted by its parent. In this case, the cpu - affinity of all the tasks in the formerly invalid partition - will be associated to the CPUs of the newly formed partition. - Changing the partition state of an invalid partition root to - "member" is always allowed even if child cpusets are present. + When set to "isolated", the CPUs in that partition root will + be in an isolated state without any load balancing from the + scheduler. Tasks in such a partition must be explicitly bound + to each individual CPU. + + "cpuset.cpus" must always be set up first before enabling + partition. Unlike "member" whose "cpuset.cpus.effective" can + contain CPUs not in "cpuset.cpus", this can never happen with a + valid partition root. In other words, "cpuset.cpus.effective" + is always a subset of "cpuset.cpus" for a valid partition root. + + When a parent partition root cannot exclusively grant any of + the CPUs specified in "cpuset.cpus", "cpuset.cpus.effective" + becomes empty. If there are tasks in the partition root, the + partition root becomes invalid and "cpuset.cpus.effective" + is reset to that of the nearest non-empty ancestor. + + Note that a task cannot be moved to a cgroup with empty + "cpuset.cpus.effective". + + There are additional constraints on where a partition root can + be enabled ("root" or "isolated"). It can only be enabled in + a cgroup if all the following conditions are met. + + 1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are + not shared by any of its siblings. + 2) The parent cgroup is a valid partition root. + 3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus". + 4) There is no child cgroups with cpuset enabled. This avoids + cpu migrations of multiple cgroups simultaneously which can + be problematic. + + On read, the "cpuset.cpus.partition" file can show the following + values. + + ====================== ============================== + "member" Non-root member of a partition + "root" Partition root + "isolated" Partition root without load balancing + "root invalid (<reason>)" Invalid partition root + ====================== ============================== + + In the case of an invalid partition root, a descriptive string on + why the partition is invalid is included within parentheses. + + Once becoming a partition root, changes to "cpuset.cpus" + is generally allowed as long as the cpu list is exclusive, + non-empty and is a superset of children's cpu lists. + + The constraints of a valid partition root are as follows: + + 1) The parent cgroup is a valid partition root. + 2) "cpuset.cpus.effective" is a subset of "cpuset.cpus" + 3) "cpuset.cpus.effective" is non-empty when there are tasks + in the partition. + + Changes to "cpuset.cpus" or cpu hotplug may cause the state + of a valid partition root to become invalid when one or more + constraints of a valid partition root are violated. Therefore, + user space agents that manage partition roots should avoid + unnecessary changes to "cpuset.cpus" and always check the state + of "cpuset.cpus.partition" after making changes to make sure + that the partitions are functioning properly as expected. + + Changing a partition root to "member" is always allowed. + If there are child partition roots underneath it, however, + they will be forced to be switched back to "member" too and + lose their partitions. So care must be taken to double check + for this condition before disabling a partition root. + + Setting a cgroup to a valid partition root will take the CPUs + away from the effective CPUs of the parent partition. + + A valid parent partition may distribute out all its CPUs to + its child partitions as long as it is not the root cgroup as + we need some house-keeping CPUs in the root cgroup. + + An invalid partition is not a real partition even though some + internal states may still be kept. + + An invalid partition root can be reverted back to a real + partition root if none of the constraints of a valid partition + root are violated. + + Poll and inotify events are triggered whenever the state of + "cpuset.cpus.partition" changes. That includes changes caused by + write to "cpuset.cpus.partition", cpu hotplug and other changes + that make the partition invalid. This will allow user space + agents to monitor unexpected changes to "cpuset.cpus.partition" + without the need to do continuous polling. Device controller
Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced "isolated" cpuset partition type as well as other changes made in other cpuset patches. Signed-off-by: Waiman Long <longman@redhat.com> --- Documentation/admin-guide/cgroup-v2.rst | 153 ++++++++++++++---------- 1 file changed, 93 insertions(+), 60 deletions(-)