Message ID | 20241029101507.7188-3-patryk.wlazlyn@linux.intel.com (mailing list archive) |
---|---|
State | Changes Requested, archived |
Headers | show |
Series | SRF: Fix offline CPU preventing pc6 entry | expand |
On 10/29/24 03:15, Patryk Wlazlyn wrote: > +void smp_set_mwait_play_dead_hint(unsigned int hint) > +{ > + WRITE_ONCE(play_dead_mwait_hint, hint); > +} This all feels a bit hacky and unstructured to me. Could we at least set up a few rules here? Like, say what the hints are, what values can they have? Where do they come from? Can this get called more than once? Does it _need_ to be set? What's the behavior when it is not set? Who is responsible for calling this? What good does the smp_ prefix do? I don't think _callers_ care whether this is getting optimized out or not. > - hint = get_deepest_mwait_hint(); > + hint = READ_ONCE(play_dead_mwait_hint); > + if (hint == PLAY_DEAD_MWAIT_HINT_UNSET) > + hint = get_deepest_mwait_hint(); This is also rather opaque. Why are there two hints? What makes one better than the other one?
On Tue, 2024-10-29 at 11:30 -0700, Dave Hansen wrote: > On 10/29/24 03:15, Patryk Wlazlyn wrote: > > +void smp_set_mwait_play_dead_hint(unsigned int hint) > > +{ > > + WRITE_ONCE(play_dead_mwait_hint, hint); > > +} > > This all feels a bit hacky and unstructured to me. > > Could we at least set up a few rules here? Like, say what the hints > are, what values can they have? Where do they come from? Can this get > called more than once? Does it _need_ to be set? What's the behavior > when it is not set? Who is responsible for calling this? > > What good does the smp_ prefix do? I don't think _callers_ care whether > this is getting optimized out or not. > The goal of 'get_deepest_mwait_hint()' is to find the mwait hint of the deepest available C-state, in order to request it for the offline CPU. On Intel CPUs, the C-states and their mwait hint values are platform-specific. Generally, there is no architectural way for enumerating mwait hints on Intel CPUs. In the idle path (different to the CPU offline path), idle drivers (if enabled) enumerate and request C-states using either ACPI mechanisms or a compiled-in, per-platform custom C-states table, provided by Intel for specific platforms. In the CPU offline path, only the deepest C-state hint is needed. Historically, it was determined using a simple algorithm, which happened to provide the correct result on most Intel platforms. This algorithm is based on scanning CPUID leaf 5 EDX bits and building the hint value from the C-state and sub-state numbers. Generally speaking, mwait hints are opaque numbers, and the algorithm is not architectural. While it produces the correct results for most Intel CPUs, it produces sub-optimal result for some CPUs. For example Intel Sierra Forest Xeon CPU, the algorithm produces hint 0x21, while the actual deepest C-state hint is 0x23. If hint 0x21 is used, the result is that the offline CPU does not enter the deepest available C-state. While this is not fatal, the CPU ends up saving less energy than it could have saved. The 'set_mwait_play_dead_hint()' function provides a mechanism for defining the mwait hint for the offline CPU, and can be used for platforms where the generic non-architectural algorithm provides a sub-optimal result. Q&A. 1. Could we at least set up a few rules here? Like, say what the hints are, what values can they have? The hints are 8-bit values, lower 4 bits define "sub-state", higher 4 bits define the state. The state value (higher 4 bits) correspond to the state enumerated by CPUID leaf 5 (Value 0 is C0, value 1 is C1, etc). The sub-state value is an opaque number. The hint is provided to the mwait instruction via EAX. 2. Where do they come from? Hardware C-states are defined by the specific platform (e.g., C1, C1E, CC6, PC6). Then they are "mapped" to the SDM C-states (C0, C1, C2, etc). The specific platform defines the hint values. Intel typically provides the hint values in the EDS (External Design Specification) document. It is typically non-public. Intel also discloses the hint values for open-source projects like Linux, and then Intel engineers submit them to the intel_idle driver. Some of the hints may also be found via ACPI _CST table. 3. Can this get called more than once? It is not supposed to. The idea is that if a driver like intel_idle is used, it can call 'set_mwait_play_dead_hint()' and provide the most optimal hint number for the offline code. 4. Does it _need_ to be set? No. It is more of an optimization. But it is an important optimization which may result in saving a lot of money in a datacenter. Typically using a "wrong" hint value is non-fatal, at least I did not see it being fatal so far. The CPU will map it to some hardware C-state request, but which one - depends on the "wrong" value and the CPU. It just may be sub- optimal. 5. What's the behavior when it is not set? The offline code will fall-back to the generic non-architectural algorithm, which provides correct results for all server platforms I dealt with since 2017. It should provide the correct hint for most client platforms, as far as I am aware. Sierra Forest Xeon is the first platform where the generic algorithm provides a sub-optimal value 0x21. It is not fatal, just sub-optimal. Note: I am working with Intel firmware team on having the FW "re-mapping" hint 0x21 to hint 0x23, so that "unaware" Linux kernel also ends up with requesting the deepest C-state for an offline CPU. 6. Who is responsible for calling this? The idea for now is that the intel_idle driver calls it. But in theory, in the future, any driver/platform code may call it if it "knows" what's the most optimal hint, I suppose. I do not have a good example though. Artem.
>> +void smp_set_mwait_play_dead_hint(unsigned int hint) >> +{ >> + WRITE_ONCE(play_dead_mwait_hint, hint); >> +} > > This all feels a bit hacky and unstructured to me. > > Could we at least set up a few rules here? Like, say what the hints > are, what values can they have? Where do they come from? Can this get > called more than once? Does it _need_ to be set? What's the behavior > when it is not set? Who is responsible for calling this? The other idea is to first check if currently loaded idle driver provides enter_dead() callback first and leave the current, deepest mwait hint computation code as a fallback. Does that sound less hacky? Unfortunately, it comes with a little problem. In case of kexec, we need to have a way to exit from the mwait loop and enter hlt to prevent offlined CPU from crashing when the old memory is being overwritten. I think, we can solve it by bringing the CPU back online before we proceed with kexec, but I would appreciate some feedback from someone who is more familiar with kexec, before merging that. We may also signal that by touching the resched flag on which enter_dead() code will monitor in case of mwait and enter hlt right after, but that's a bit hackier IMO. > What good does the smp_ prefix do? I don't think _callers_ care whether > this is getting optimized out or not. The prefix makes it a little bit cleaner by not exporting new global symbol with "set_mwait_play_dead_hint" name.
On 10/30/24 02:58, Artem Bityutskiy wrote: > On Tue, 2024-10-29 at 11:30 -0700, Dave Hansen wrote: > 1. Could we at least set up a few rules here? Like, say what the hints > are, what values can they have? > > The hints are 8-bit values, lower 4 bits define "sub-state", higher 4 bits > define the state. > > The state value (higher 4 bits) correspond to the state enumerated by CPUID leaf > 5 (Value 0 is C0, value 1 is C1, etc). The sub-state value is an opaque number. > > The hint is provided to the mwait instruction via EAX. OK, so can you distill that down to something succinct and get it in a comment above the new function, please? > 2. Where do they come from? > > Hardware C-states are defined by the specific platform (e.g., C1, C1E, CC6, > PC6). Then they are "mapped" to the SDM C-states (C0, C1, C2, etc). The specific > platform defines the hint values. > > Intel typically provides the hint values in the EDS (External Design > Specification) document. It is typically non-public. > > Intel also discloses the hint values for open-source projects like Linux, and > then Intel engineers submit them to the intel_idle driver. > > Some of the hints may also be found via ACPI _CST table. What about the mwait_play_dead() loop that calculates the hint? Doesn't that derive the hint from CPUID? > 3. Can this get called more than once? > > It is not supposed to. The idea is that if a driver like intel_idle is used, it > can call 'set_mwait_play_dead_hint()' and provide the most optimal hint number > for the offline code. There are two important nuggets in there: First, an idle driver can but is not required to set the hint. This would be good comment material. Second, only one thing is supposed to set the hint. This is a good thing to WARN() about. > 4. Does it _need_ to be set? > > No. It is more of an optimization. But it is an important optimization which may > result in saving a lot of money in a datacenter. > > Typically using a "wrong" hint value is non-fatal, at least I did not see it > being fatal so far. The CPU will map it to some hardware C-state request, but > which one - depends on the "wrong" value and the CPU. It just may be sub- > optimal. OK, so this tells me we *don't* want some kind of: WARN_ON(play_dead_mwait_hint == PLAY_DEAD_MWAIT_HINT_UNSET); warning. > 5. What's the behavior when it is not set? > > The offline code will fall-back to the generic non-architectural algorithm, > which provides correct results for all server platforms I dealt with since 2017. > It should provide the correct hint for most client platforms, as far as I am > aware. > > Sierra Forest Xeon is the first platform where the generic algorithm provides a > sub-optimal value 0x21. It is not fatal, just sub-optimal. What is the non-architectural algorithm? Which Linux code are you referring to? > Note: I am working with Intel firmware team on having the FW "re-mapping" hint > 0x21 to hint 0x23, so that "unaware" Linux kernel also ends up with requesting > the deepest C-state for an offline CPU. That would be great as well. Thanks for doing that! > 6. Who is responsible for calling this? > > The idea for now is that the intel_idle driver calls it. > > But in theory, in the future, any driver/platform code may call it if it "knows" > what's the most optimal hint, I suppose. I do not have a good example though. So let's look at how this works: void native_play_dead(void) { ... mwait_play_dead(); if (cpuidle_play_dead()) hlt_play_dead(); } This _existing_ code has three different ways of playing dead (in this order of preference): 1. mwait 2. cpuidle 3. hlt It has (at least) two different mechanisms for telling which of these to call: 1. mwait has a bunch of built-in logic that will ensure the CPU should use for playing dead. If not, it does nothing and returns. 2. cpuidle_play_dead() (only used by acpi_idle_driver as far as I can tell) will return an error if it does not support playing dead If 1 and 2 fail, then hlt_play_dead() gets called. But the really fun part of this is that idle driver is *called* here. The driver that is also responsible for overriding the mwait hint. So this series opts to have the boot code plumb the hint back into a basically undocumented global variable while also assuming that the system is *going* to use mwait. It then does *nothing* with the callback just adjacent to the code it wants to modify. Seems rather spaghetti-like to me. To make it worse, go look at da6fa7ef67f0 ("x86/smpboot: Don't use mwait_play_dead() on AMD systems"). It hacks AMD-specific code in mwait_play_dead() just to force the cpuidle code to get called. What if we did this? First, introduce a helper: bool mwait_play_dead_with_hint(u32 hint) and then restructure native_play_dead() to look like this: static mwait_play_dead_generic(void) { u32 hint = get_deepest_mwait_hint(); return mwait_play_dead_with_hint(hint); } void native_play_dead(void) { bool used; used = cpuidle_play_dead(); if (used) return; used = mwait_play_dead_generic(); if (used) return; hlt_play_dead(); } If the cpuidle drivers want to use mwait with a different hint, they override the *EXISTING* drv->states[].enter_dead() functionality and call mwait_play_dead_with_hint() with their new hint. Then they don't need to pass anything _over_ to the mwait code. Wouldn't something like that makes this all much more straightforward?
On Wed, Oct 30, 2024 at 8:32 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/30/24 02:58, Artem Bityutskiy wrote: > > On Tue, 2024-10-29 at 11:30 -0700, Dave Hansen wrote: > > 1. Could we at least set up a few rules here? Like, say what the hints > > are, what values can they have? > > > > The hints are 8-bit values, lower 4 bits define "sub-state", higher 4 bits > > define the state. > > > > The state value (higher 4 bits) correspond to the state enumerated by CPUID leaf > > 5 (Value 0 is C0, value 1 is C1, etc). The sub-state value is an opaque number. > > > > The hint is provided to the mwait instruction via EAX. > > OK, so can you distill that down to something succinct and get it in a > comment above the new function, please? > > > 2. Where do they come from? > > > > Hardware C-states are defined by the specific platform (e.g., C1, C1E, CC6, > > PC6). Then they are "mapped" to the SDM C-states (C0, C1, C2, etc). The specific > > platform defines the hint values. > > > > Intel typically provides the hint values in the EDS (External Design > > Specification) document. It is typically non-public. > > > > Intel also discloses the hint values for open-source projects like Linux, and > > then Intel engineers submit them to the intel_idle driver. > > > > Some of the hints may also be found via ACPI _CST table. > > What about the mwait_play_dead() loop that calculates the hint? Doesn't > that derive the hint from CPUID? > > > 3. Can this get called more than once? > > > > It is not supposed to. The idea is that if a driver like intel_idle is used, it > > can call 'set_mwait_play_dead_hint()' and provide the most optimal hint number > > for the offline code. > > There are two important nuggets in there: > > First, an idle driver can but is not required to set the hint. This > would be good comment material. > > Second, only one thing is supposed to set the hint. This is a good > thing to WARN() about. > > > 4. Does it _need_ to be set? > > > > No. It is more of an optimization. But it is an important optimization which may > > result in saving a lot of money in a datacenter. > > > > Typically using a "wrong" hint value is non-fatal, at least I did not see it > > being fatal so far. The CPU will map it to some hardware C-state request, but > > which one - depends on the "wrong" value and the CPU. It just may be sub- > > optimal. > > OK, so this tells me we *don't* want some kind of: > > WARN_ON(play_dead_mwait_hint == PLAY_DEAD_MWAIT_HINT_UNSET); > > warning. > > > 5. What's the behavior when it is not set? > > > > The offline code will fall-back to the generic non-architectural algorithm, > > which provides correct results for all server platforms I dealt with since 2017. > > It should provide the correct hint for most client platforms, as far as I am > > aware. > > > > Sierra Forest Xeon is the first platform where the generic algorithm provides a > > sub-optimal value 0x21. It is not fatal, just sub-optimal. > > What is the non-architectural algorithm? Which Linux code are you > referring to? > > > Note: I am working with Intel firmware team on having the FW "re-mapping" hint > > 0x21 to hint 0x23, so that "unaware" Linux kernel also ends up with requesting > > the deepest C-state for an offline CPU. > > That would be great as well. Thanks for doing that! > > > 6. Who is responsible for calling this? > > > > The idea for now is that the intel_idle driver calls it. > > > > But in theory, in the future, any driver/platform code may call it if it "knows" > > what's the most optimal hint, I suppose. I do not have a good example though. > > So let's look at how this works: > > void native_play_dead(void) > { > ... > mwait_play_dead(); > if (cpuidle_play_dead()) > hlt_play_dead(); > } > > This _existing_ code has three different ways of playing dead (in this > order of preference): > > 1. mwait > 2. cpuidle > 3. hlt > > It has (at least) two different mechanisms for telling which of these to > call: > > 1. mwait has a bunch of built-in logic that will ensure the CPU > should use for playing dead. If not, it does nothing and returns. > 2. cpuidle_play_dead() (only used by acpi_idle_driver as far as I can > tell) will return an error if it does not support playing dead > > If 1 and 2 fail, then hlt_play_dead() gets called. > > But the really fun part of this is that idle driver is *called* here. Currently, cpuidle_play_dead() is for the cases when MWAIT is not supported. > The driver that is also responsible for overriding the mwait hint. So no, intel_idle is not called there because it only uses MWAIT. > So this series opts to have the boot code plumb the hint back into a > basically undocumented global variable while also assuming that the > system is *going* to use mwait. It then does *nothing* with the > callback just adjacent to the code it wants to modify. > > Seems rather spaghetti-like to me. > > To make it worse, go look at da6fa7ef67f0 ("x86/smpboot: Don't use > mwait_play_dead() on AMD systems"). It hacks AMD-specific code in > mwait_play_dead() just to force the cpuidle code to get called. > > What if we did this? First, introduce a helper: > > bool mwait_play_dead_with_hint(u32 hint) > > and then restructure native_play_dead() to look like this: > > static mwait_play_dead_generic(void) > { > u32 hint = get_deepest_mwait_hint(); > > return mwait_play_dead_with_hint(hint); > } > > void native_play_dead(void) > { > bool used; > > used = cpuidle_play_dead(); > if (used) > return; > > used = mwait_play_dead_generic(); > if (used) > return; > > hlt_play_dead(); > } > > If the cpuidle drivers want to use mwait with a different hint, they > override the *EXISTING* drv->states[].enter_dead() functionality and > call mwait_play_dead_with_hint() with their new hint. Then they don't > need to pass anything _over_ to the mwait code. > > Wouldn't something like that makes this all much more straightforward? Well, except for one detail which is this beautiful thing in mwait_play_dead(): if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) { /* * Kexec is about to happen. Don't go back into mwait() as * the kexec kernel might overwrite text and data including * page tables and stack. So mwait() would resume when the * monitor cache line is written to and then the CPU goes * south due to overwritten text, page tables and stack. * * Note: This does _NOT_ protect against a stray MCE, NMI, * SMI. They will resume execution at the instruction * following the HLT instruction and run into the problem * which this is trying to prevent. */ WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT); while(1) native_halt(); clearly referred to as a kexec() hack, which cannot be done in cpuidle_play_dead() because the cpuidle driver doesn't know how to get to md->control. And even if it did, it is kind of not its business to deal with this stuff.
On 10/30/24 12:53, Rafael J. Wysocki wrote: > clearly referred to as a kexec() hack, which cannot be done in > cpuidle_play_dead() because the cpuidle driver doesn't know how to get > to md->control. What if we have an mwait_play_dead() _helper_? It takes the hint as an argument and retains all the kexec hacks. All the cpuidle driver has to do is call the helper with the hint that the cpuidle driver determines.
On Wed, Oct 30, 2024 at 9:11 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/30/24 12:53, Rafael J. Wysocki wrote: > > clearly referred to as a kexec() hack, which cannot be done in > > cpuidle_play_dead() because the cpuidle driver doesn't know how to get > > to md->control. > > What if we have an mwait_play_dead() _helper_? It takes the hint as an > argument and retains all the kexec hacks. All the cpuidle driver has to > do is call the helper with the hint that the cpuidle driver determines. The same idea has occurred to me in the meantime, so yes, I think that it would work.
On 10/30/24 06:33, Patryk Wlazlyn wrote: >>> +void smp_set_mwait_play_dead_hint(unsigned int hint) >>> +{ >>> + WRITE_ONCE(play_dead_mwait_hint, hint); >>> +} >> >> This all feels a bit hacky and unstructured to me. >> >> Could we at least set up a few rules here? Like, say what the hints >> are, what values can they have? Where do they come from? Can this get >> called more than once? Does it _need_ to be set? What's the behavior >> when it is not set? Who is responsible for calling this? > > The other idea is to first check if currently loaded idle driver provides > enter_dead() callback first and leave the current, deepest mwait hint > computation code as a fallback. > > Does that sound less hacky? Yes. > Unfortunately, it comes with a little problem. In case of kexec, we need to > have a way to exit from the mwait loop and enter hlt to prevent offlined CPU > from crashing when the old memory is being overwritten. Why is this a problem? Like I mentioning to Rafael, just *call* mwait_play_dead() from the idle driver. mwait_play_dead() could probably also use some refactoring because it has 3 pieces: 1. Should the code run at all or defer to another play dead implementation? 2. Calculating the hint (obviously not needed in your new case) 3. Actually running mwait (including the kexec hack) The "should the code run?" bit is superfluous but harmless if called from the idle driver. Ditto on the hint calculation, but you already factored it out. That leaves the "actually run mwait" bit which you 100% need. > I think, we can solve it by bringing the CPU back online before we proceed > with kexec, but I would appreciate some feedback from someone who is more > familiar with kexec, before merging that. > > We may also signal that by touching the resched flag on which enter_dead() > code will monitor in case of mwait and enter hlt right after, but that's a > bit hackier IMO. That route is also fine with me, but I'm not sure it's necessary. >> What good does the smp_ prefix do? I don't think _callers_ care whether >> this is getting optimized out or not. > > The prefix makes it a little bit cleaner by not exporting new global symbol > with "set_mwait_play_dead_hint" name. I'm not following.
Hi Dave, community, I'll cover general questions, and let Patryk cover the specific core-related comments. On Wed, 2024-10-30 at 12:32 -0700, Dave Hansen wrote: > On 10/30/24 02:58, Artem Bityutskiy wrote: > > On Tue, 2024-10-29 at 11:30 -0700, Dave Hansen wrote: > > 1. Could we at least set up a few rules here? Like, say what the hints > > are, what values can they have? > > > > The hints are 8-bit values, lower 4 bits define "sub-state", higher 4 bits > > define the state. > > > > The state value (higher 4 bits) correspond to the state enumerated by CPUID > > leaf > > 5 (Value 0 is C0, value 1 is C1, etc). The sub-state value is an opaque > > number. > > > > The hint is provided to the mwait instruction via EAX. > > OK, so can you distill that down to something succinct and get it in a > comment above the new function, please? Yes, that was my idea to write a long e-mail to cover your and other reviewer's questions, and let Patryk turn this into nice comments at the right places. > > 2. Where do they come from? > > > > Hardware C-states are defined by the specific platform (e.g., C1, C1E, CC6, > > PC6). Then they are "mapped" to the SDM C-states (C0, C1, C2, etc). The > > specific > > platform defines the hint values. > > > > Intel typically provides the hint values in the EDS (External Design > > Specification) document. It is typically non-public. > > > > Intel also discloses the hint values for open-source projects like Linux, > > and > > then Intel engineers submit them to the intel_idle driver. > > > > Some of the hints may also be found via ACPI _CST table. > > What about the mwait_play_dead() loop that calculates the hint? Doesn't > that derive the hint from CPUID? The mwait_play_dead() hint calculation algorithm is the root of the problem. It calculates a sub-optimal hint for some platforms, such as Sierra Forest Xeon. So at high level, mwait_play_dead() calculates the hint, saves it in the 'eax' variable, and then just eeps issuing 'mwait(eax, ecx=0)' with that hint. So the mwait hint calculation code is between lines [1303] and [1313] (see links below). The code basically finds the highest C-state index ('highest_cstate' variable) and count of sub-states within that C-state ('highest_subcstate' variable), and forms the mwait hint out of them. You should also check leaf CPUID Leaf 5 SDM doc for more clarity. Example #1. ========== Sapphire Rapids Xeon, where the algorithm calculates the correct hint. 'highest_cstate' ends up being 2 (corresponds C3 in CPUID leaf 5 terms, note that C0 is ignored), and highest_subcstate ends up being 1 (1 sub-state), so the mwait hint ends up being 0x20. The math is at line [1311]: eax = (highest_cstate << MWAIT_SUBSTATE_SIZE) | (highest_subcstate - 1) = (2 << 4) | (1 - 1) = 0x20 This ends up being the deepest available C-state hint (C6). No problem here. Code lines. 1303: https://elixir.bootlin.com/linux/v6.9/source/arch/x86/kernel/smpboot.c#L1303 1313: https://elixir.bootlin.com/linux/v6.9/source/arch/x86/kernel/smpboot.c#L1313 1311: https://elixir.bootlin.com/linux/v6.9/source/arch/x86/kernel/smpboot.c#L1311 Exemple #2. ========== Sierra Forest Xeon, where the algorithm calculates the incorrect hint. Here is CPUID leaf 5. $ cpuid -1 -l 5 CPU: MONITOR/MWAIT (5): smallest monitor-line size (bytes) = 0x40 (64) largest monitor-line size (bytes) = 0x40 (64) enum of Monitor-MWAIT exts supported = true supports intrs as break-event for MWAIT = true number of C0 sub C-states using MWAIT = 0x0 (0) number of C1 sub C-states using MWAIT = 0x2 (2) number of C2 sub C-states using MWAIT = 0x0 (0) number of C3 sub C-states using MWAIT = 0x2 (2) number of C4 sub C-states using MWAIT = 0x0 (0) number of C5 sub C-states using MWAIT = 0x0 (0) number of C6 sub C-states using MWAIT = 0x0 (0) number of C7 sub C-states using MWAIT = 0x0 (0) Notice that CPUID leaf 5 advertises two C3 sub-states, and with this input the the algorithm ends up with mwait hint 0x21. This mwait hint is not documented as supported on Sierra Forest. The supported Sierra Forest mwait hints are defined in 'intel_idle.c': https://github.com/torvalds/linux/blob/2e1b3cc9d7f790145a80cb705b168f05dab65df2/drivers/idle/intel_idle.c#L1345 They are 0x0, 0x01, 0x22, 0x23. And 0x23 is the deepest C-state (C6SP). But offline code ends up issuing 0x21. The platform maps it to a "shallow version" of core C6, which is not documented and is not supposed to be used. At any rate, it ends up in a situation when offlining a single CPU will prevent the platform from ever reaching PC6 state. In PC6 - happens when all CPUs requested C6SP - all cores are powered off and there are power savings in uncore (caches, interconnects, memory controller, etc). Now, as I mentioned, in parallel I am working with Sierra Forest firmware team to change platform behavior and map hint 0x21 to 0x23 in firmware. As a workaround. > > > 5. What's the behavior when it is not set? > > > > The offline code will fall-back to the generic non-architectural algorithm, > > which provides correct results for all server platforms I dealt with since > > 2017. > > It should provide the correct hint for most client platforms, as far as I am > > aware. > > > > Sierra Forest Xeon is the first platform where the generic algorithm > > provides a > > sub-optimal value 0x21. It is not fatal, just sub-optimal. > > What is the non-architectural algorithm? Which Linux code are you > referring to? See above. The non-architectural part of it is basically this. The algorithm assumes that in the sub-state part of the mwait hint (4 least significant bits) always starts with 0 and then next sub-state is always 1, and there are no gaps. This is not documented in Intel SDM. This happens to work on majority of Intel platforms, but this is not architectural. Thank you, Artem.
On 11/6/24 00:14, Artem Bityutskiy wrote: > The non-architectural part of it is basically this. > > The algorithm assumes that in the sub-state part of the mwait hint (4 least > significant bits) always starts with 0 and then next sub-state is always 1, and > there are no gaps. This is not documented in Intel SDM. This happens to work on > majority of Intel platforms, but this is not architectural. I wouldn't call that a "non-architectural algorithm". It's a bug, plain and simple.
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index ca073f40698f..2cb083a84225 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -114,6 +114,7 @@ void wbinvd_on_cpu(int cpu); int wbinvd_on_all_cpus(void); void smp_kick_mwait_play_dead(void); +void smp_set_mwait_play_dead_hint(unsigned int hint); void native_smp_send_reschedule(int cpu); void native_send_call_func_ipi(const struct cpumask *mask); @@ -164,6 +165,8 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu) { return (struct cpumask *)cpumask_of(0); } + +static inline void smp_set_mwait_play_dead_hint(unsigned int hint) { } #endif /* CONFIG_SMP */ #ifdef CONFIG_DEBUG_NMI_SELFTEST diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 683898e3b20e..08f7b43f3fc3 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -127,6 +127,9 @@ int __read_mostly __max_smt_threads = 1; /* Flag to indicate if a complete sched domain rebuild is required */ bool x86_topology_update; +#define PLAY_DEAD_MWAIT_HINT_UNSET 0U +static unsigned int __read_mostly play_dead_mwait_hint; + int arch_update_cpu_topology(void) { int retval = x86_topology_update; @@ -1270,6 +1273,11 @@ void play_dead_common(void) local_irq_disable(); } +void smp_set_mwait_play_dead_hint(unsigned int hint) +{ + WRITE_ONCE(play_dead_mwait_hint, hint); +} + /* Computes mwait hint for the deepest mwait hint based on cpuid leaf 0x5 */ static inline unsigned int get_deepest_mwait_hint(void) { @@ -1322,7 +1330,9 @@ static inline void mwait_play_dead(void) if (__this_cpu_read(cpu_info.cpuid_level) < CPUID_MWAIT_LEAF) return; - hint = get_deepest_mwait_hint(); + hint = READ_ONCE(play_dead_mwait_hint); + if (hint == PLAY_DEAD_MWAIT_HINT_UNSET) + hint = get_deepest_mwait_hint(); /* Set up state for the kexec() hack below */ md->status = CPUDEAD_MWAIT_WAIT;
The current implementation for looking up the mwait hint for the deepest cstate depends on them to be continuous in range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86 platforms, it is not architectural and may not result in reaching the most optimized idle state on some of them. For example Intel's Sierra Forest report two C6 substates in cpuid leaf 5: C6S (hint 0x22) C6SP (hint 0x23) Hints 0x20 and 0x21 are skipped entirely, causing the current implementation to compute the wrong hint, when looking for the deepest cstate for offlined CPU to enter. As a result, package with an offlined CPU can never reach PC6. Allow the idle driver to communicate the deepest idle cstate to the x86 offline code. Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> --- arch/x86/include/asm/smp.h | 3 +++ arch/x86/kernel/smpboot.c | 12 +++++++++++- 2 files changed, 14 insertions(+), 1 deletion(-)