Message ID | 20210825213750.6933-6-longman@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus | expand |
Hello, Waiman. Let's stop iterating on the patchset until we reach a consensus. On Wed, Aug 25, 2021 at 05:37:49PM -0400, Waiman Long wrote: > 1) The "cpuset.cpus" is not empty and the list of CPUs are > exclusive, i.e. they are not shared by any of its siblings. Part of it can be reached by cpus going offline. > 2) The parent cgroup is a partition root. This condition can happen if a parent stop being a partition. > - 3) The "cpuset.cpus" is also a proper subset of the parent's > + 3) The "cpuset.cpus" is a subset of the parent's > "cpuset.cpus.effective". This can happen if cpus go offline. > 4) There is no child cgroups with cpuset enabled. This is for > eliminating corner cases that have to be handled if such a > condition is allowed. This may make sense as a short cut for us but doesn't really stem from interface or behavior requirements. Of the four conditions listed, two are bogus (the states can be reached through a different path and the configuration success or failure can be timing dependent if configuration racaes against cpu hotplug operations) and one maybe makes sense half-way and one is more of a shortcut. Can't we just replace these with transitions to invalid state with proper explanation? That'd get rid of the error handling duplications from both the kernel and user side, make automated configurations which may race against hot plug operations reliable, and consistently provide users with why something failed. Thank you.
On 8/26/21 1:35 PM, Tejun Heo wrote: > Hello, Waiman. > > Let's stop iterating on the patchset until we reach a consensus. > > On Wed, Aug 25, 2021 at 05:37:49PM -0400, Waiman Long wrote: >> 1) The "cpuset.cpus" is not empty and the list of CPUs are >> exclusive, i.e. they are not shared by any of its siblings. > Part of it can be reached by cpus going offline. > >> 2) The parent cgroup is a partition root. > This condition can happen if a parent stop being a partition. > >> - 3) The "cpuset.cpus" is also a proper subset of the parent's >> + 3) The "cpuset.cpus" is a subset of the parent's >> "cpuset.cpus.effective". > This can happen if cpus go offline. > >> 4) There is no child cgroups with cpuset enabled. This is for >> eliminating corner cases that have to be handled if such a >> condition is allowed. > This may make sense as a short cut for us but doesn't really stem from > interface or behavior requirements. > > Of the four conditions listed, two are bogus (the states can be > reached through a different path and the configuration success or > failure can be timing dependent if configuration racaes against cpu > hotplug operations) and one maybe makes sense half-way and one is more > of a shortcut. > > Can't we just replace these with transitions to invalid state with > proper explanation? That'd get rid of the error handling duplications > from both the kernel and user side, make automated configurations > which may race against hot plug operations reliable, and consistently > provide users with why something failed. What I am doing here is setting a high bar for transitioning from member to either "root" or "isolated". Once it becomes a partition, there are multiple ways that can make it invalid. I am fine with that. However, I am not sure it is a good idea to allow users to echo "root" to cpuset.cpus.partition anywhere in the cgroup hierarchy and require them to read it back to see if it succeed. All the checking are done with cpuset_rwsem held. So there shouldn't be any racing. Of course, a hotplug can immediately follow and make the partition invalid. Cheers, Longman
Hello, On Thu, Aug 26, 2021 at 11:01:30PM -0400, Waiman Long wrote: > What I am doing here is setting a high bar for transitioning from member to > either "root" or "isolated". Once it becomes a partition, there are multiple > ways that can make it invalid. I am fine with that. However, I am not sure > it is a good idea to allow users to echo "root" to cpuset.cpus.partition > anywhere in the cgroup hierarchy and require them to read it back to see if > it succeed. The problem is that the "high" bar is rather arbitrary. It might feel like a good idea to some but not to others. There are no clear technical reasons or principles for rules to be set this particular way. > All the checking are done with cpuset_rwsem held. So there shouldn't be any > racing. Of course, a hotplug can immediately follow and make the partition > invalid. Imagine a system which dynamically on/offlines its cpus based on load or whatever and also configures partitions for cases where the needed cpus are online. If the partitions are set up while the cpus are online, it'd work as expected - partitions are in effect when the system can support them and ignored otherwise. However, if the partition configuration is attempted while the cpus happen to be offline, the configuration will fail, and there is no guaranteed way to make that configuration stick short of disabling hotplug operations. This is a pretty jarring brekage happening exactly because the behavior is an inconsistent amalgam. It's usually not a good sign if interface restrictions can be added or removed because how one feels without clear functional reasons and often indicates that there's something broken, which seems to be the case here too. Thanks.
On 8/27/21 12:00 AM, Tejun Heo wrote: > Hello, > > On Thu, Aug 26, 2021 at 11:01:30PM -0400, Waiman Long wrote: >> What I am doing here is setting a high bar for transitioning from member to >> either "root" or "isolated". Once it becomes a partition, there are multiple >> ways that can make it invalid. I am fine with that. However, I am not sure >> it is a good idea to allow users to echo "root" to cpuset.cpus.partition >> anywhere in the cgroup hierarchy and require them to read it back to see if >> it succeed. > The problem is that the "high" bar is rather arbitrary. It might feel like a > good idea to some but not to others. There are no clear technical reasons or > principles for rules to be set this particular way. > >> All the checking are done with cpuset_rwsem held. So there shouldn't be any >> racing. Of course, a hotplug can immediately follow and make the partition >> invalid. > Imagine a system which dynamically on/offlines its cpus based on load or > whatever and also configures partitions for cases where the needed cpus are > online. If the partitions are set up while the cpus are online, it'd work as > expected - partitions are in effect when the system can support them and > ignored otherwise. However, if the partition configuration is attempted > while the cpus happen to be offline, the configuration will fail, and there > is no guaranteed way to make that configuration stick short of disabling > hotplug operations. This is a pretty jarring brekage happening exactly > because the behavior is an inconsistent amalgam. > > It's usually not a good sign if interface restrictions can be added or > removed because how one feels without clear functional reasons and often > indicates that there's something broken, which seems to be the case here > too. Well, that is a valid point. The cpus may have been offlined when a partition is being created. I can certainly relent on this check in forming a partition. IOW, cpus_allowed can contain some or all offline cpus and a valid (some are online) or invalid (all are offline) partition can be formed. I can also allow an invalid child partition to be formed with an invalid parent partition. However, the cpu exclusivity rules will still apply. Other than that, do you envision any other circumstances where we should allow an invalid partition to be formed? Cheers, Longman
Hello, On Fri, Aug 27, 2021 at 05:19:31PM -0400, Waiman Long wrote: > Well, that is a valid point. The cpus may have been offlined when a > partition is being created. I can certainly relent on this check in forming > a partition. IOW, cpus_allowed can contain some or all offline cpus and a > valid (some are online) or invalid (all are offline) partition can be > formed. I can also allow an invalid child partition to be formed with an > invalid parent partition. However, the cpu exclusivity rules will still > apply. > > Other than that, do you envision any other circumstances where we should > allow an invalid partition to be formed? Now that most restrictions are removed from configuration side, just go all the way? Given that the user must check the status afterwards anyway, I don't see technical or even usability reasons for leaving some pieces behind. Going all the way would be easier to use too - bang in the target config and read the resulting state to reliably find out why a partition isn't valid, especially if we list *all* the reasons so that the user can tell whether the configuration is as intended immediately. Thanks.
On 8/27/21 5:27 PM, Tejun Heo wrote: > Hello, > > On Fri, Aug 27, 2021 at 05:19:31PM -0400, Waiman Long wrote: >> Well, that is a valid point. The cpus may have been offlined when a >> partition is being created. I can certainly relent on this check in forming >> a partition. IOW, cpus_allowed can contain some or all offline cpus and a >> valid (some are online) or invalid (all are offline) partition can be >> formed. I can also allow an invalid child partition to be formed with an >> invalid parent partition. However, the cpu exclusivity rules will still >> apply. >> >> Other than that, do you envision any other circumstances where we should >> allow an invalid partition to be formed? > Now that most restrictions are removed from configuration side, just go all > the way? Given that the user must check the status afterwards anyway, I > don't see technical or even usability reasons for leaving some pieces > behind. Going all the way would be easier to use too - bang in the target > config and read the resulting state to reliably find out why a partition > isn't valid, especially if we list *all* the reasons so that the user > tell whether the configuration is as intended immediately. The cpu exclusivity rule is due to the setting of CPU_EXCLUSIVE bit. This is a pre-existing condition unless you want to change how the cpuset.cpu_exclusive works. So the new rules will be: 1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive. 2) The parent cgroup is a partition root (can be an invalid one). 3) The "cpuset.cpus" is a subset of the parent's cpuset.cpus.allowed. 4) No child cgroup with cpuset enabled. I think they are reasonable. What do you think? Cheers, Longman
Hello, On Fri, Aug 27, 2021 at 06:50:10PM -0400, Waiman Long wrote: > The cpu exclusivity rule is due to the setting of CPU_EXCLUSIVE bit. This is > a pre-existing condition unless you want to change how the > cpuset.cpu_exclusive works. > > So the new rules will be: > > 1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive. Empty cpu list can be considered an exclusive one. > 2) The parent cgroup is a partition root (can be an invalid one). Does this mean a partition parent can't stop being a partition if one or more of its children become partitions? If so, it violates the rule that a descendant shouldn't be able to restrict what its ancestors can do. > 3) The "cpuset.cpus" is a subset of the parent's cpuset.cpus.allowed. Why not just go by effective? This would mean that a parent can't withdraw CPUs from its allowed set once descendants are configured. Restrictions like this are fine when the entire hierarchy is configured by a single entity but become awkward when configurations are multi-tiered, automated and dynamic. > 4) No child cgroup with cpuset enabled. idk, maybe? I'm having a hard time seeing the point in adding these restrictions when the state transitions are asynchronous anyway. Would it help if we try to separate what's absoluately and technically necessary and what seems reasonable or high bar and try to justify why each of the latter should be added? Thanks.
On 8/27/21 7:35 PM, Tejun Heo wrote: > Hello, > > On Fri, Aug 27, 2021 at 06:50:10PM -0400, Waiman Long wrote: >> The cpu exclusivity rule is due to the setting of CPU_EXCLUSIVE bit. This is >> a pre-existing condition unless you want to change how the >> cpuset.cpu_exclusive works. >> >> So the new rules will be: >> >> 1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive. > Empty cpu list can be considered an exclusive one. It doesn't make sense to me to have a partition with no cpu configured at all. I very much prefer the users to set cpuset.cpus first before turning it into a partition. > >> 2) The parent cgroup is a partition root (can be an invalid one). > Does this mean a partition parent can't stop being a partition if one or > more of its children become partitions? If so, it violates the rule that a > descendant shouldn't be able to restrict what its ancestors can do. No. As I said in the documentation, transitioning from partition root to member is allowed. Against, it is illogical to allow a cpuset to become a potential partition if it parent is not even a partition root at all. In the case that the parent is reverted back to a member, the child partitions will stay invalid forever unless the parent become a valid partition again. > >> 3) The "cpuset.cpus" is a subset of the parent's cpuset.cpus.allowed. > Why not just go by effective? This would mean that a parent can't withdraw > CPUs from its allowed set once descendants are configured. Restrictions like > this are fine when the entire hierarchy is configured by a single entity but > become awkward when configurations are multi-tiered, automated and dynamic. The original rule is to be based on effective cpus. However, to properly handle the case of allowing offlined cpus to be included in the partition, I have to change it to cpu_allowed instead. I can certainly change it back to effective if you prefer. > >> 4) No child cgroup with cpuset enabled. > idk, maybe? I'm having a hard time seeing the point in adding these > restrictions when the state transitions are asynchronous anyway. Would it > help if we try to separate what's absoluately and technically necessary and > what seems reasonable or high bar and try to justify why each of the latter > should be added? This rule is there mainly for ease of implementation. Otherwise, I need to add additional code to handle the conversion of child cpusets which can be rather complex and require a lot more debugging. This rule will no longer apply once the cpuset becomes a partition root. Cheers, Longman
Hello. On Fri, Aug 27, 2021 at 06:50:10PM -0400, Waiman Long <llong@redhat.com> wrote: > So the new rules will be: When I followed the thread, it seemed to me you're talking past each other a bit. I'd suggest the following terminology: - config space: what's written by the user and saved, - reality space: what's currently available (primarily subject to on-/offlinng but I think it'd be helpful to consider here also what's given by the parent), - effect space: what's actually possible and happening. Not all elements of config_space x reality_space (Cartesian product) can be represented in the effect_space (e.g. root partition with no (effective) cpus). IIUC, Waiman's "high bar" is supposed to be defined over transitions in the config_space. However, there can be independent changes in the reality_space so the rules should be actually formulated in the effect_space: The conditions for being a valid partition root rewritten into the effect space: > 1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive. - effective CPUs are non-empty and exclusive wrt siblings - (E.g. setting empty cpuset.cpus might be possible but it invalidates the partition root, same as offlining or removal by an ancestor.) > 2) The parent cgroup is a partition root (can be an invalid one). - parent cgroup is a (valid) partition - (Being valid partition means owning "stolen" cpus from the parent, if the parent is not valid partition itself, you can't steal what is not owned.) - (And I think it's OK that: "the child partitions will stay invalid forever unless the parent become a valid partition again" [1].) > 3) The "cpuset.cpus" is a subset of the parent's cpuset.cpus.allowed. - I'm not sure what is the use of this condition (together with the rewrite of the 1st condition which covers effective cpus). I think it would make sense if being a valid parition root guaranteed that all configured cpuset.cpus will be available, however, that's not the case IIUC (e.g. due to offlining). > 4) No child cgroup with cpuset enabled. - A child cgroup with cpuset enabled is OK in the effect space (achievable by switching first and creating children later). - For technical reasons this may be a condition on the transitions in the config_space. Generally, most config changes should succeed and user should check (or watch) how they landed in combination with the reality_space. Regards, Michal [1] This follows the general model where ancestors can "preempt" resources from their subtree.
On Wed, Oct 06, 2021 at 02:21:03PM -0400, Waiman Long <llong@redhat.com> wrote: > Sorry for not following up with this patchset sooner as I was busy on other > tasks. Thanks for continuing with this. > 1) The "cpuset.cpus" is not empty and the list of CPUs are > exclusive, i.e. they are not shared by any of its siblings. > 2) The parent cgroup is a partition root. > 3) The "cpuset.cpus" is a subset of the union of parent's > "cpuset.cpus.effective" and offlined CPUs in parent's > "cpuset.cpus". > 4) There is no child cgroups with cpuset enabled. This avoids > cpu migrations of multiple cgroups simultaneously which can > be problematic. > > A partition, when enabled, can be in an invalid state. An example > is when its parent is also an invalid partition. You say: "it can only be enabled in a cgroup if all the following conditions are met.", "2) The parent cgroup is a partition root." and then the example: "A partition, when enabled, can be in an invalid state. An example is when its parent is also an invalid partition." But the first two statements imply you can't have enabled the partition in such a case. I think there is still mixup of partition validity conditions and transition conditions, yours would roughly divide into (not precisely, just to share my understanding): Validity conditions 1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive, i.e. they are not shared by any of its siblings. 2) The parent cgroup is a partition root. Transition conditions: 3) The "cpuset.cpus" is a subset of the union of parent's "cpuset.cpus.effective" and offlined CPUs in parent's "cpuset.cpus". 4) There is no child cgroups with cpuset enabled. This avoids cpu migrations of multiple cgroups simultaneously which can be problematic. (I've put no. 3 into transition conditions because _after_ the transition parent's cpuset.cpus.effective are subtracted the new root's cpuset.cpus but I'd like to have something similar as a validity condition but I haven't come up with that yet.) I consider the following situation: r // all cpus 0-7 `- part1 cpus=0-3 root >partition ` subpart1 cpus=0-1 root >partition ` subpart2 cpus=2-3 root >partition `- other cpus=4-7 // member by default Both subpart1 and subpart2 are valid partition roots. Look at actions listed below (as alternatives, not a sequence): a) hotplug offlines cpu 3 - would part1 still be considered a valid root? - perhaps not - would subpart1 still be considered a valid root? - it could be, but its parent is invalid so no? - would subpart2 still be considered a valid root? - perhaps not b) administrative change writes 0-2 into part1 cpus - would part1 still be considered a valid root? - yes - would subpart1 still be considered a valid root? - yes - would subpart2 still be considered a valid root? - perhaps not c) administrative change writes 3-7 into `other` cpus - should this fail or invalidate a root partition part1? - perhaps fail since the same "owner" manages all siblings and should reduce part1 first The answers above are just my "natural" responses, the ideal may be different. The issue I want to illustrate is that if all the conditions are formed as transition conditions only, they can't be used to reason about hotplug or config changes (except for cpuset.cpus.partitions writes). What would help me with the understanding -- the invalid root partition is defined as 1) such a cgroup where no cpus are granted from the top (and thus has to fall back to ancestors) or 2) such a cgroup where cpus requested in cpuset.cpus can't be fulfilled (i.e. any missing invalidates)? Furthermore, another example (motivated by the patch 4/6) r // all cpus 0-7 `- part1 cpus=0-4 root >partition ` subpart1 cpus=0-1 root >partition ` subpart2 cpus=2-3 root >partition ` task `- other cpus=5-7 // member by default It's a valid and achievable state (even on v2 since cpuset is a threaded controller). a) cpu 4 is offlined - this should invalidate part1 (and propagate invalidation into subpart1 and subpart2). b) administrative write 0-3 into part1 cpus - should this invalidate part1 or be rejected? In conclusion, it'd be good to have validity conditions separate from transition conditions (since hotplug transition can't be rejected) and perhaps treat administrative changes from an ancestor equally as a hotplug. Thanks, Michal
On 10/12/21 10:39 AM, Michal Koutný wrote: > On Wed, Oct 06, 2021 at 02:21:03PM -0400, Waiman Long <llong@redhat.com> wrote: >> Sorry for not following up with this patchset sooner as I was busy on other >> tasks. > Thanks for continuing with this. > >> 1) The "cpuset.cpus" is not empty and the list of CPUs are >> exclusive, i.e. they are not shared by any of its siblings. >> 2) The parent cgroup is a partition root. >> 3) The "cpuset.cpus" is a subset of the union of parent's >> "cpuset.cpus.effective" and offlined CPUs in parent's >> "cpuset.cpus". >> 4) There is no child cgroups with cpuset enabled. This avoids >> cpu migrations of multiple cgroups simultaneously which can >> be problematic. >> >> A partition, when enabled, can be in an invalid state. An example Thanks for the comments. >> is when its parent is also an invalid partition. > You say: > "it can only be enabled in a cgroup if all the following conditions are met.", > "2) The parent cgroup is a partition root." > > and then the example: > "A partition, when enabled, can be in an invalid state. An example is > when its parent is also an invalid partition." > > But the first two statements imply you can't have enabled the partition > in such a case. Yes, you are right. We should not allow enabling partition when the parent is an invalid right. I will fix that. > I think there is still mixup of partition validity conditions and > transition conditions, yours would roughly divide into (not precisely, > just to share my understanding): > > Validity conditions > 1) The "cpuset.cpus" is not empty and the list of CPUs are > exclusive, i.e. they are not shared by any of its siblings. > 2) The parent cgroup is a partition root. > > Transition conditions: > 3) The "cpuset.cpus" is a subset of the union of parent's > "cpuset.cpus.effective" and offlined CPUs in parent's > "cpuset.cpus". I am going to change this condition to just "cpuset.cpus" is a subset of parent's "cpuset.cpus". After some deliberation, I had concluded it doesn't make sense from the system partition planning point of view to allow a valid partition to contain cpus that are not in the designated "cpuset.cpus". That will automatically included offlined cpus in parent's "cpuset.cpus". > 4) There is no child cgroups with cpuset enabled. This avoids > cpu migrations of multiple cgroups simultaneously which can > be problematic. > > (I've put no. 3 into transition conditions because _after_ the > transition parent's cpuset.cpus.effective are subtracted the new root's > cpuset.cpus but I'd like to have something similar as a validity > condition but I haven't come up with that yet.) > > I consider the following situation: > > r // all cpus 0-7 > `- part1 cpus=0-3 root >partition > ` subpart1 cpus=0-1 root >partition > ` subpart2 cpus=2-3 root >partition > `- other cpus=4-7 // member by default > > Both subpart1 and subpart2 are valid partition roots. > Look at actions listed below (as alternatives, not a sequence): > > a) hotplug offlines cpu 3 > - would part1 still be considered a valid root? > - perhaps not > - would subpart1 still be considered a valid root? > - it could be, but its parent is invalid so no? > - would subpart2 still be considered a valid root? > - perhaps not > They will all be valid roots. They will become invalid only when their effective cpus are empty and there are tasks in the partition. > b) administrative change writes 0-2 into part1 cpus That is actually not allowed because of the following code in validate_change(): static int validate_change(struct cpuset *cur, struct cpuset *trial) { : /* Each of our child cpusets must be a subset of us */ ret = -EBUSY; cpuset_for_each_child(c, css, cur) if (!is_cpuset_subset(c, trial)) goto out; > - would part1 still be considered a valid root? > - yes > - would subpart1 still be considered a valid root? > - yes > - would subpart2 still be considered a valid root? > - perhaps not > > c) administrative change writes 3-7 into `other` cpus > - should this fail or invalidate a root partition part1? > - perhaps fail since the same "owner" manages all siblings and > should reduce part1 first Again, this will not be allowed because of the CPU_EXCLUSIVE flag set in part1. > > The answers above are just my "natural" responses, the ideal may be > different. The issue I want to illustrate is that if all the conditions > are formed as transition conditions only, they can't be used to reason > about hotplug or config changes (except for cpuset.cpus.partitions > writes). > > What would help me with the understanding -- the invalid root partition is defined as > 1) such a cgroup where no cpus are granted from the top (and thus has to fall back to ancestors) > or > 2) such a cgroup where cpus requested in cpuset.cpus can't be fulfilled (i.e. any missing invalidates)? For a valid partition, "cpuset.cpus.effective" is always a subset of "cpuset.cpus". When "cpuset.cpus.effective" becomes empty and there are tasks in the partition, it becomes invalid and inherent the non-empty cpuset.cpus.effective of the nearest ancestor. The condition that causes "cpuset.cpus.effective" to become empty can be hotplug or changes to "cpuset.cpus". > Furthermore, another example (motivated by the patch 4/6) > > r // all cpus 0-7 > `- part1 cpus=0-4 root >partition > ` subpart1 cpus=0-1 root >partition > ` subpart2 cpus=2-3 root >partition > ` task > `- other cpus=5-7 // member by default > > It's a valid and achievable state (even on v2 since cpuset is a threaded > controller). > > a) cpu 4 is offlined > - this should invalidate part1 (and propagate invalidation into > subpart1 and subpart2). That is subject to design. My current thought is to keep part1 as valid but invalidate the child partitions (subpart1 and subpart2). > b) administrative write 0-3 into part1 cpus > - should this invalidate part1 or be rejected? The result should be the same as (a). > > In conclusion, it'd be good to have validity conditions separate from > transition conditions (since hotplug transition can't be rejected) and > perhaps treat administrative changes from an ancestor equally as a > hotplug. I am trying to make the result of changing "cpuset.cpus" as close to hotplug as possible but there are cases where the "cpuset.cpus" change is prohibited but hotplug can still happen to remove the cpu. Hope this will help to clarify the current design. Cheers, Longman
On 10/13/21 5:45 PM, Waiman Long wrote: > > >> >> In conclusion, it'd be good to have validity conditions separate from >> transition conditions (since hotplug transition can't be rejected) and >> perhaps treat administrative changes from an ancestor equally as a >> hotplug. > > I am trying to make the result of changing "cpuset.cpus" as close to > hotplug as possible but there are cases where the "cpuset.cpus" change > is prohibited but hotplug can still happen to remove the cpu. > > Hope this will help to clarify the current design. > BTW, the attached file is the current draft of cpuset.cpus.partition document. Cheers, Longman cpuset.cpus.partition A read-write single value file which exists on non-root cpuset-enabled cgroups. This flag is owned by the parent cgroup and is not delegatable. It accepts only the following input values when written to. ======== ================================ "member" Non-root member of a partition "root" Partition root "isolated" Partition root without load balancing ======== ================================ When set to be a partition root, the current cgroup is the root of a new partition or scheduling domain that comprises itself and all its descendants except those that are separate partition roots themselves and their descendants. The root cgroup is always a partition root. When set to "isolated", the CPUs in that partition root will be in an isolated state without any load balancing from the scheduler. Tasks in such a partition must be explicitly bound to each individual CPU. "cpuset.cpus" must always be set up first before enabling partition. Unlike "member" whose "cpuset.cpus.effective" can contain CPUs not in "cpuset.cpus", this can never happen with a valid partition root. In other words, "cpuset.cpus.effective" is always a subset of "cpuset.cpus" for a valid partition root. When a parent partition root cannot exclusively grant any of the CPUs specified in "cpuset.cpus", "cpuset.cpus.effective" becomes empty. If there are tasks in the partition root, the partition root becomes invalid and "cpuset.cpus.effective" is reset to that of the nearest non-empty ancestor. Note that a task cannot be moved to a cgroup with empty "cpuset.cpus.effective". There are additional constraints on where a partition root can be enabled ("root" or "isolated"). It can only be enabled in a cgroup if all the following conditions are met. 1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are not shared by any of its siblings. 2) The parent cgroup is a valid partition root. 3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus". 4) There is no child cgroups with cpuset enabled. This avoids cpu migrations of multiple cgroups simultaneously which can be problematic. On read, the "cpuset.cpus.partition" file can show the following values. ====================== ============================== "member" Non-root member of a partition "root" Partition root "isolated" Partition root without load balancing "root invalid (<reason>)" Invalid partition root ====================== ============================== In the case of an invalid partition root, a descriptive string on why the partition is invalid is included within parentheses. Once becoming a partition root, changes to "cpuset.cpus" is generally allowed as long as the cpu list is exclusive and is a superset of children's cpu lists. The constraints of a valid partition root are as follows: 1) "cpuset.cpus" is non-empty and exclusive. 2) The parent cgroup is a valid partition root. 3) "cpuset.cpus.effective" is a subset of "cpuset.cpus" 4) "cpuset.cpus.effective" is non-empty when there are tasks in the partition. Changes to "cpuset.cpus" or cpu hotplug may cause the state of a valid partition root to become invalid when one or more constraints of a valid partition root are violated. Therefore, user space agents that manage partition roots should avoid unnecessary changes to "cpuset.cpus" and always check the state of "cpuset.cpus.partition" after making changes to make sure that the partitions are functioning properly as expected. Changing a partition root to "member" is always allowed. If there are child partition roots underneath it, however, they will be forced to be switched back to "member" too and lose their partitions. So care must be taken to double check for this condition before disabling a partition root. Setting a cgroup to a valid partition root will take the CPUs away from the effective CPUs of the parent partition. A valid parent partition may distribute out all its CPUs to its child partitions as long as it is not the root cgroup as we need some house-keeping CPUs in the root cgroup. An invalid partition is not a real partition even though some internal states may still be kept. An invalid partition root can be reverted back to a real partition root if none of the constraints of a valid partition root are violated. Poll and inotify events are triggered whenever the state of "cpuset.cpus.partition" changes. That includes changes caused by write to "cpuset.cpus.partition", cpu hotplug and other changes that make the partition invalid. This will allow user space agents to monitor unexpected changes to "cpuset.cpus.partition" without the need to do continuous polling.
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index babbe04c8d37..e759b0898bce 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -2091,8 +2091,9 @@ Cpuset Interface Files It accepts only the following input values when written to. ======== ================================ - "root" a partition root - "member" a non-root member of a partition + "member" Non-root member of a partition + "root" Partition root + "isolated" Partition root without load balancing ======== ================================ When set to be a partition root, the current cgroup is the @@ -2101,64 +2102,77 @@ Cpuset Interface Files partition roots themselves and their descendants. The root cgroup is always a partition root. - There are constraints on where a partition root can be set. - It can only be set in a cgroup if all the following conditions - are true. + When set to "isolated", the CPUs in that partition root will + be in an isolated state without any load balancing from the + scheduler. Tasks in such a partition must be explicitly bound + to each individual CPU. + + There are constraints on where a partition root can be set + ("root" or "isolated"). It can only be set in a cgroup if all + the following conditions are true. 1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive, i.e. they are not shared by any of its siblings. 2) The parent cgroup is a partition root. - 3) The "cpuset.cpus" is also a proper subset of the parent's + 3) The "cpuset.cpus" is a subset of the parent's "cpuset.cpus.effective". 4) There is no child cgroups with cpuset enabled. This is for eliminating corner cases that have to be handled if such a condition is allowed. - Setting it to partition root will take the CPUs away from the - effective CPUs of the parent cgroup. Once it is set, this - file cannot be reverted back to "member" if there are any child + Setting it to a partition root will take the CPUs away from the + effective CPUs of the parent cgroup. Once it is set, this file + should not be reverted back to "member" if there are any child cgroups with cpuset enabled. - A parent partition cannot distribute all its CPUs to its - child partitions. There must be at least one cpu left in the - parent partition. - - Once becoming a partition root, changes to "cpuset.cpus" is - generally allowed as long as the first condition above is true, - the change will not take away all the CPUs from the parent - partition and the new "cpuset.cpus" value is a superset of its - children's "cpuset.cpus" values. - - Sometimes, external factors like changes to ancestors' - "cpuset.cpus" or cpu hotplug can cause the state of the partition - root to change. On read, the "cpuset.sched.partition" file - can show the following values. - - ============== ============================== - "member" Non-root member of a partition - "root" Partition root - "root invalid" Invalid partition root - ============== ============================== - - It is a partition root if the first 2 partition root conditions - above are true and at least one CPU from "cpuset.cpus" is - granted by the parent cgroup. - - A partition root can become invalid if none of CPUs requested - in "cpuset.cpus" can be granted by the parent cgroup or the - parent cgroup is no longer a partition root itself. In this - case, it is not a real partition even though the restriction - of the first partition root condition above will still apply. - The cpu affinity of all the tasks in the cgroup will then be - associated with CPUs in the nearest ancestor partition. - - An invalid partition root can be transitioned back to a - real partition root if at least one of the requested CPUs - can now be granted by its parent. In this case, the cpu - affinity of all the tasks in the formerly invalid partition - will be associated to the CPUs of the newly formed partition. - Changing the partition state of an invalid partition root to - "member" is always allowed even if child cpusets are present. + A parent partition may distribute all its CPUs to its child + partitions as long as it is not the root cgroup. + + Once becoming a partition root, changes to "cpuset.cpus" + is generally allowed as long as the first condition above + (cpu exclusivity rule) is true. + + Sometimes, changes to "cpuset.cpus" or cpu hotplug may cause + the state of the partition root to become invalid when the + other constraints of partition root are violated. Therefore, + user space agents that manage partition roots should avoid + unnecessary changes to "cpuset.cpus" and monitor the state of + "cpuset.cpus.partition" to make sure that the partitions are + functioning as expected. + + On read, the "cpuset.cpus.partition" file can show the following + values. + + ====================== ============================== + "member" Non-root member of a partition + "root" Partition root + "isolated" Partition root without load balancing + "root invalid (<reason>)" Invalid partition root + ====================== ============================== + + A partition root becomes invalid if all the CPUs requested in + "cpuset.cpus" become unavailable. This can happen if all the + CPUs have been offlined, or the state of an ancestor partition + root become invalid. "<reason>" is a string that describes why + the partition becomes invalid. + + An invalid partition is not a real partition even though some + internal states may still be kept. The cpu affinity of all + the tasks in the cgroup will then be associated with CPUs in + the nearest ancestor partition. + + An invalid partition root can be reverted back to a real + partition root if at least one of the requested CPUs become + available again. In this case, the cpu affinity of all the + tasks in the formerly invalid partition will be associated to + the CPUs of the newly formed partition. + + Poll and inotify events are triggered whenever the state of + "cpuset.cpus.partition" changes. That includes changes caused by + write to "cpuset.cpus.partition", cpu hotplug and other changes + that make the partition invalid. This will allow user space + agents to monitor unexpected changes to "cpuset.cpus.partition" + without the need to do continuous polling. Device controller
Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced "isolated" cpuset partition type as well as the ability to create non-top cpuset partition with no cpu allocated to it. Signed-off-by: Waiman Long <longman@redhat.com> --- Documentation/admin-guide/cgroup-v2.rst | 112 +++++++++++++----------- 1 file changed, 63 insertions(+), 49 deletions(-)