Message ID | 52B31B21.6010901@zytor.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On Thu, Dec 19, 2013 at 08:13:21AM -0800, H. Peter Anvin wrote: > How does this look? Completely untested, of course. > > I do wonder if we need more memory barriers, though. > > An alternative would be to move everything into mwait_idle_with_hints(). > > -hpa > > diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h > index 7b034a4057f9..6dce588f94b4 100644 > --- a/arch/x86/include/asm/processor.h > +++ b/arch/x86/include/asm/processor.h > @@ -723,6 +723,23 @@ static inline void __sti_mwait(unsigned long eax, unsigned long ecx) > :: "a" (eax), "c" (ecx)); > } > > +/* > + * Issue a clflush in preparation for a monitor instruction if the CPU > + * needs it. We force the address into the ax register to get a fixed > + * length for the instruction, however, this is what the monitor instruction > + * is going to need anyway, so it shouldn't add any additional code. > + */ > +static inline void clflush_monitor(const void *addr, unsigned long ecx, > + unsigned long edx) > +{ > + alternative_input(ASM_NOP3, > + "clflush (%0)", > + X86_FEATURE_CLFLUSH_MONITOR, > + "a" (addr)); > + __monitor(addr, eax, edx); > + smp_mb(); > +} What's that mb for? Also, can you please merge: http://marc.info/?l=linux-kernel&m=138685838420632 Thomas said he would pick up that series, but seems to have gone missing the past week or so. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/19/2013 08:21 AM, Peter Zijlstra wrote: > > What's that mb for? > It already exists in mwait_idle_with_hints(); I just moved it into this common function. It is a bit odd, I have to admit; it seems like it should be *before* the monitor (and possibly we should have one after the CLFLUSH as well?) > Also, can you please merge: > > http://marc.info/?l=linux-kernel&m=138685838420632 > > Thomas said he would pick up that series, but seems to have gone missing > the past week or so. > We need to get the immediate regression fixed. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* H. Peter Anvin <hpa@zytor.com> wrote: > On 12/19/2013 08:21 AM, Peter Zijlstra wrote: > > > > What's that mb for? > > > > It already exists in mwait_idle_with_hints(); I just moved it into > this common function. It is a bit odd, I have to admit; it seems > like it should be *before* the monitor (and possibly we should have > one after the CLFLUSH as well?) Yes, I think we need a barrier before the CLFLUSH, because according to my reading of the Intel documentation CLFLUSH has no implicit ordering so it might get reordered with the store to ->flags in current_set_polling_and_test(), which might result in spurious wakeup problems again. (And CLFLUSH is a store in a sense, so special in that the regular ordering for stores does not apply.) Likewise, having a barrier before the MONITOR looks sensible as well. Having it _after_ monitor looks weird and is probably wrong. [It might have been the effects of someone seeing the spurious wakeup problems with realizing the true source, or so.] Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote: > > * H. Peter Anvin <hpa@zytor.com> wrote: > > > On 12/19/2013 08:21 AM, Peter Zijlstra wrote: > > > > > > What's that mb for? > > > > > > > It already exists in mwait_idle_with_hints(); I just moved it into > > this common function. It is a bit odd, I have to admit; it seems > > like it should be *before* the monitor (and possibly we should have > > one after the CLFLUSH as well?) > > Yes, I think we need a barrier before the CLFLUSH, because according > to my reading of the Intel documentation CLFLUSH has no implicit > ordering so it might get reordered with the store to ->flags in > current_set_polling_and_test(), which might result in spurious wakeup > problems again. No it cannot; since current_set_polling_and_test() already has a barrier to prevent that. Also, the location patched by hpa doesn't actually call that at all. That said, I would find it very strange indeed if a CLFLUSH doesn't also flush the store buffer. > (And CLFLUSH is a store in a sense, so special in that the regular > ordering for stores does not apply.) > > Likewise, having a barrier before the MONITOR looks sensible as well. > Having it _after_ monitor looks weird and is probably wrong. [It might > have been the effects of someone seeing the spurious wakeup problems > with realizing the true source, or so.] I again have to disagree, one would expect monitor to flush all that is required to start the monitor -- and it actually does so. As is testified by this extra CLFLUSH being called a bug workaround. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote: > That said, I would find it very strange indeed if a CLFLUSH doesn't also > flush the store buffer. OK, it explicitly states it does not do that and you indeed need an mfence before the clflush. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote: > On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote: > > Likewise, having a barrier before the MONITOR looks sensible as well. > > I again have to disagree, one would expect monitor to flush all that is > required to start the monitor -- and it actually does so. As is > testified by this extra CLFLUSH being called a bug workaround. SDM states that MONITOR is ordered like a LOAD, and a LOAD cannot pass a previous STORE to the same address. That said; there's enough holes in there to swim a titanic through, seeing how MONITOR stares at an entire cacheline and LOAD/STORE order is specified on location, whatever that means. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/19/2013 09:36 AM, Peter Zijlstra wrote: > On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote: >> That said, I would find it very strange indeed if a CLFLUSH doesn't also >> flush the store buffer. > > OK, it explicitly states it does not do that and you indeed need an > mfence before the clflush. > So, MONITOR is defined to be ordered as a load, which I think should be adequate, but I really wonder if we should have mfence on both sides of clflush. This now is up to 9 bytes, and perhaps pushing it a bit with how much we would be willing to patch out. On the other hand - the CLFLUSH seems to have worked well enough by itself, and this is all probabilistic anyway, so perhaps we should just leave the naked CLFLUSH in and not worry about it unless measurements say otherwise? -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/19/2013 09:07 AM, Ingo Molnar wrote: > > Likewise, having a barrier before the MONITOR looks sensible as well. > Having it _after_ monitor looks weird and is probably wrong. [It might > have been the effects of someone seeing the spurious wakeup problems > with realizing the true source, or so.] > Does anyone know the history of this barrier after the monitor? I know Len is looking for a minimal patchset that can go into -stable, and it seems prudent to not preturb the code more than necessary, but going forward it would be nice to know... -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote: > > > > * H. Peter Anvin <hpa@zytor.com> wrote: > > > > > On 12/19/2013 08:21 AM, Peter Zijlstra wrote: > > > > > > > > What's that mb for? > > > > > > > > > > It already exists in mwait_idle_with_hints(); I just moved it into > > > this common function. It is a bit odd, I have to admit; it seems > > > like it should be *before* the monitor (and possibly we should have > > > one after the CLFLUSH as well?) > > > > Yes, I think we need a barrier before the CLFLUSH, because according > > to my reading of the Intel documentation CLFLUSH has no implicit > > ordering so it might get reordered with the store to ->flags in > > current_set_polling_and_test(), which might result in spurious wakeup > > problems again. > > No it cannot; since current_set_polling_and_test() already has a > barrier to prevent that. See below: > Also, the location patched by hpa doesn't actually call that at all. > > That said, I would find it very strange indeed if a CLFLUSH doesn't > also flush the store buffer. So, the Intel documentation says (sorry about the lazy-link): http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/intel/vtune/doc/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc31.htm "CLFLUSH is only ordered by the MFENCE instruction. It is not guaranteed to be ordered by any other fencing, serializing or other CLFLUSH instruction. For example, software can use an MFENCE instruction to insure that previous stores are included in the write-back." So a specific MFENCE barrier is needed. Also note that this wording excludes implicit serialization such as LOCK prefix or XCHG barriers. As it happens current_set_polling_and_test() uses smp_mb(), which happens to map to MFENCE on all CPUs that can do CLFLUSH, but that's really just an accident and in no way engineered. _At minimum_ we need a prominent comment at the clflush usage site that we rely on the MFENCE in current_set_polling_and_test() ... > > (And CLFLUSH is a store in a sense, so special in that the regular > > ordering for stores does not apply.) > > > > Likewise, having a barrier before the MONITOR looks sensible as > > well. Having it _after_ monitor looks weird and is probably wrong. > > [It might have been the effects of someone seeing the spurious > > wakeup problems with realizing the true source, or so.] > > I again have to disagree, one would expect monitor to flush all that > is required to start the monitor -- and it actually does so. As is > testified by this extra CLFLUSH being called a bug workaround. This assumption would be safer - although AFAICS the Intel MONITOR/MWAIT documentation is quiet about this aspect. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* H. Peter Anvin <hpa@zytor.com> wrote: > On 12/19/2013 09:36 AM, Peter Zijlstra wrote: > > On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote: > >> That said, I would find it very strange indeed if a CLFLUSH doesn't also > >> flush the store buffer. > > > > OK, it explicitly states it does not do that and you indeed need > > an mfence before the clflush. > > So, MONITOR is defined to be ordered as a load, which I think should > be adequate, but I really wonder if we should have mfence on both > sides of clflush. This now is up to 9 bytes, and perhaps pushing it > a bit with how much we would be willing to patch out. > > On the other hand - the CLFLUSH seems to have worked well enough by > itself, and this is all probabilistic anyway, so perhaps we should > just leave the naked CLFLUSH in and not worry about it unless > measurements say otherwise? So I think the window of breakage was rather large here, and since it seems to trigger on rare types of hardware I think we'd be better off by erring on the side of robustness this time around ... This is the 'go to idle' path, which isn't as time-critical as the 'get out of idle' code path. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote: > > On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote: > > > Likewise, having a barrier before the MONITOR looks sensible as well. > > > > I again have to disagree, one would expect monitor to flush all that is > > required to start the monitor -- and it actually does so. As is > > testified by this extra CLFLUSH being called a bug workaround. > > SDM states that MONITOR is ordered like a LOAD, and a LOAD cannot > pass a previous STORE to the same address. Yes ... but you could argue that CLFLUSH is neither a load nor a store, it's a _cache sync_ operation, with its special ordering properties. > That said; there's enough holes in there to swim a titanic through, > seeing how MONITOR stares at an entire cacheline and LOAD/STORE > order is specified on location, whatever that means. I think assuming that MONITOR is ordered as a load or better is a pretty safe one (and in fact the Intel documentation seems to say so) - I'd say MONITOR is in micro-code and essentially snoops on cache events on that specific cache line, and loads the cache line on a snoop hit? Btw., what state is the cache line after a MONITOR instruction, is it loaded as shared, or as excusive? (exclusive would probably be better for performance.) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/19/2013 10:09 AM, H. Peter Anvin wrote: > On 12/19/2013 09:07 AM, Ingo Molnar wrote: >> >> Likewise, having a barrier before the MONITOR looks sensible as well. >> Having it _after_ monitor looks weird and is probably wrong. [It might >> have been the effects of someone seeing the spurious wakeup problems >> with realizing the true source, or so.] >> > > Does anyone know the history of this barrier after the monitor? I know > Len is looking for a minimal patchset that can go into -stable, and it > seems prudent to not preturb the code more than necessary, but going > forward it would be nice to know... > Hmm... it *looks* like it is intended to be part of the construct: smp_mb(); if (!need_resched()) ... I found a note in the HLT variant of the function saying: /* * TS_POLLING-cleared state must be visible before we * test NEED_RESCHED: */ ... which presumably has been copied elsewhere. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* H. Peter Anvin <hpa@zytor.com> wrote: > On 12/19/2013 09:07 AM, Ingo Molnar wrote: > > > > Likewise, having a barrier before the MONITOR looks sensible as > > well. Having it _after_ monitor looks weird and is probably wrong. > > [It might have been the effects of someone seeing the spurious > > wakeup problems with realizing the true source, or so.] > > Does anyone know the history of this barrier after the monitor? I > know Len is looking for a minimal patchset that can go into -stable, > and it seems prudent to not preturb the code more than necessary, > but going forward it would be nice to know... For the minimal fix I don't think we should change it - but for v3.14 it looks like a speedup for the from-idle codepath, which is performance sensitive. ( It would also be nice to know whether MONITOR loads that cacheline into the CPUs cache, and in what state it loads it. ) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* H. Peter Anvin <hpa@zytor.com> wrote: > On 12/19/2013 10:09 AM, H. Peter Anvin wrote: > > On 12/19/2013 09:07 AM, Ingo Molnar wrote: > >> > >> Likewise, having a barrier before the MONITOR looks sensible as well. > >> Having it _after_ monitor looks weird and is probably wrong. [It might > >> have been the effects of someone seeing the spurious wakeup problems > >> with realizing the true source, or so.] > >> > > > > Does anyone know the history of this barrier after the monitor? I know > > Len is looking for a minimal patchset that can go into -stable, and it > > seems prudent to not preturb the code more than necessary, but going > > forward it would be nice to know... > > > > Hmm... it *looks* like it is intended to be part of the construct: > > smp_mb(); > if (!need_resched()) > ... > > I found a note in the HLT variant of the function saying: > > /* > * TS_POLLING-cleared state must be visible before we > * test NEED_RESCHED: > */ Yes, that makes sense: the need_resched test is a load, and MONITOR is a load as well. Can the two ever cross, or does the CPU guarantee that because it's the same address, the loads don't cross? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Linus Torvalds <torvalds@linux-foundation.org> wrote: > The x86 memory rules are that two loads always execute in order (ie > rmb is a no-op). > > So I see no reason for a memory barrier after the monitor. [...] Yes, I'm leaning towards that interpretation as well, but the reason I'm a bit catious is the somewhat curious (to me!) wording of the MONITOR instruction: Sets up a linear address range to be monitored by hardware and activates the monitor. ... The MONITOR instruction is ordered as a load operation with respect to other memory transactions. The instruction can be used at all privilege levels and is subject to all permission checking and faults associated with a byte load. Like a load, the MONITOR instruction sets the A-bit but not the D-bit in page tables. Where apparently the 'range' means 'full cache line surrounding the memory address in question'. We have no other load instructions that operate on such a large 'range' of addresses, and I wanted to make sure it's a true (single byte) load for that specific address. The documentation does not appear to explicitly state that it's a load for that address - only that it's ordered as a load. The reason I'm asking is because 'flags' itself might not be at the beginning of the cache line, as it's in the middle of thread_info: struct thread_info { struct task_struct *task; /* main task structure */ struct exec_domain *exec_domain; /* execution domain */ __u32 flags; /* low level flags */ while 'MONITOR' appears to work on the cache line. So are all addresses within that cache line ordered? Only the specific address given to the instruction itself? Only the first word of the cacheline itself? The documentation is a bit vague, at least in my reading, and depending on which actual word the instruction reads (if it reads any word at all ... it's probably just setting up an address for MWAIT) from that cacheline, its ordering properties might be surprising. > [...] But both sides of clflush sounds sane, and as mentioned the > "go to sleep" side isn't as critical as the "wake up" side if the > monitor. Yeah. > Please let's just make that pre-monitor hack be a static key, and do > mfence explicitly around the clflush inside that conditional > section. Agreed. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/19/2013 10:19 AM, Ingo Molnar wrote: > > ( It would also be nice to know whether MONITOR loads that cacheline > into the CPUs cache, and in what state it loads it. ) > I would assume that is implementation-dependent. However, one plausible implementation is to load the cache line into the cache in shared state and monitor for evictions. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Dec 19, 2013 at 11:22:01AM -0800, H. Peter Anvin wrote: > On 12/19/2013 10:19 AM, Ingo Molnar wrote: > > > > ( It would also be nice to know whether MONITOR loads that cacheline > > into the CPUs cache, and in what state it loads it. ) > > > > I would assume that is implementation-dependent. However, one plausible > implementation is to load the cache line into the cache in shared state > and monitor for evictions. I suppose the monitor part is the important part because certain C states drop all cache, presumably including the cacheline we're actually monitoring. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
CLFLUSH is only (guaranteed to be) ordered with respect to MFENCE according to the SDM. Ingo Molnar <mingo@kernel.org> wrote: > >* Peter Zijlstra <peterz@infradead.org> wrote: > >> On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote: >> > On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote: >> > > Likewise, having a barrier before the MONITOR looks sensible as >well. >> > >> > I again have to disagree, one would expect monitor to flush all >that is >> > required to start the monitor -- and it actually does so. As is >> > testified by this extra CLFLUSH being called a bug workaround. >> >> SDM states that MONITOR is ordered like a LOAD, and a LOAD cannot >> pass a previous STORE to the same address. > >Yes ... but you could argue that CLFLUSH is neither a load nor a >store, it's a _cache sync_ operation, with its special ordering >properties. > >> That said; there's enough holes in there to swim a titanic through, >> seeing how MONITOR stares at an entire cacheline and LOAD/STORE >> order is specified on location, whatever that means. > >I think assuming that MONITOR is ordered as a load or better is a >pretty safe one (and in fact the Intel documentation seems to say so) >- I'd say MONITOR is in micro-code and essentially snoops on cache >events on that specific cache line, and loads the cache line on a >snoop hit? > >Btw., what state is the cache line after a MONITOR instruction, is it >loaded as shared, or as excusive? (exclusive would probably be better >for performance.) > >Thanks, > > Ingo
* H. Peter Anvin <hpa@zytor.com> wrote: > CLFLUSH is only (guaranteed to be) ordered with respect to MFENCE > according to the SDM. Yes, I quoted the relevant text a couple of hours ago, CLFLUSH is pretty special, so it's not ordered against atomics or serializing instructions for example - only ordered against explict MFENCE calls, which on x86 is mb()/smp_mb(). Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 7b034a4057f9..6dce588f94b4 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -723,6 +723,23 @@ static inline void __sti_mwait(unsigned long eax, unsigned long ecx) :: "a" (eax), "c" (ecx)); } +/* + * Issue a clflush in preparation for a monitor instruction if the CPU + * needs it. We force the address into the ax register to get a fixed + * length for the instruction, however, this is what the monitor instruction + * is going to need anyway, so it shouldn't add any additional code. + */ +static inline void clflush_monitor(const void *addr, unsigned long ecx, + unsigned long edx) +{ + alternative_input(ASM_NOP3, + "clflush (%0)", + X86_FEATURE_CLFLUSH_MONITOR, + "a" (addr)); + __monitor(addr, eax, edx); + smp_mb(); +} + extern void select_idle_routine(const struct cpuinfo_x86 *c); extern void init_amd_e400_c1e_mask(void); diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c index d2b7f27781bc..b14d02354134 100644 --- a/arch/x86/kernel/acpi/cstate.c +++ b/arch/x86/kernel/acpi/cstate.c @@ -163,11 +163,7 @@ EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe); void mwait_idle_with_hints(unsigned long ax, unsigned long cx) { if (!need_resched()) { - if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) - clflush((void *)¤t_thread_info()->flags); - - __monitor((void *)¤t_thread_info()->flags, 0, 0); - smp_mb(); + clflush_monitor(¤t_thread_info()->flags, 0, 0); if (!need_resched()) __mwait(ax, cx); }