x86 idle: repair large-server 50-watt idle-power regression

Message ID	52B31B21.6010901@zytor.com (mailing list archive)
State	Not Applicable, archived
Headers	show Return-Path: <linux-pm-owner@kernel.org> Message-ID: <52B31B21.6010901@zytor.com> Date: Thu, 19 Dec 2013 08:13:21 -0800 From: "H. Peter Anvin" <hpa@zytor.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Ingo Molnar <mingo@kernel.org> CC: Len Brown <lenb@kernel.org>, x86@kernel.org, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, Len Brown <len.brown@intel.com>, stable@vger.kernel.org, Linus Torvalds <torvalds@linux-foundation.org>, Thomas Gleixner <tglx@linutronix.de>, Peter Zijlstra <a.p.zijlstra@chello.nl>, Mike Galbraith <efault@gmx.de>, Borislav Petkov <bp@alien8.de> Subject: Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression References: <baff264285f6e585df757d58b17788feabc68918.1387403066.git.len.brown@intel.com> <20131219122257.GC11279@gmail.com> <52B316FF.50906@zytor.com> <20131219160210.GA28426@gmail.com> In-Reply-To: <20131219160210.GA28426@gmail.com> Content-Type: multipart/mixed; boundary="------------000906000501000406070602" Sender: linux-pm-owner@vger.kernel.org Precedence: bulk

H. Peter Anvin Dec. 19, 2013, 4:13 p.m. UTC

How does this look?  Completely untested, of course.

I do wonder if we need more memory barriers, though.

An alternative would be to move everything into mwait_idle_with_hints().

	-hpa

Peter Zijlstra Dec. 19, 2013, 4:21 p.m. UTC | #1

On Thu, Dec 19, 2013 at 08:13:21AM -0800, H. Peter Anvin wrote:
> How does this look?  Completely untested, of course.
> 
> I do wonder if we need more memory barriers, though.
> 
> An alternative would be to move everything into mwait_idle_with_hints().
> 
> 	-hpa
> 

> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 7b034a4057f9..6dce588f94b4 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -723,6 +723,23 @@ static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
>  		     :: "a" (eax), "c" (ecx));
>  }
>  
> +/*
> + * Issue a clflush in preparation for a monitor instruction if the CPU
> + * needs it.  We force the address into the ax register to get a fixed
> + * length for the instruction, however, this is what the monitor instruction
> + * is going to need anyway, so it shouldn't add any additional code.
> + */
> +static inline void clflush_monitor(const void *addr, unsigned long ecx,
> +				   unsigned long edx)
> +{
> +	alternative_input(ASM_NOP3,
> +			  "clflush (%0)",
> +			  X86_FEATURE_CLFLUSH_MONITOR,
> +			  "a" (addr));
> +	__monitor(addr, eax, edx);
> +	smp_mb();
> +}

What's that mb for?

Also, can you please merge:

  http://marc.info/?l=linux-kernel&m=138685838420632

Thomas said he would pick up that series, but seems to have gone missing
the past week or so.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin Dec. 19, 2013, 4:50 p.m. UTC | #2

On 12/19/2013 08:21 AM, Peter Zijlstra wrote:
> 
> What's that mb for?
> 

It already exists in mwait_idle_with_hints(); I just moved it into this
common function.  It is a bit odd, I have to admit; it seems like it
should be *before* the monitor (and possibly we should have one after
the CLFLUSH as well?)

> Also, can you please merge:
> 
>   http://marc.info/?l=linux-kernel&m=138685838420632
> 
> Thomas said he would pick up that series, but seems to have gone missing
> the past week or so.
> 

We need to get the immediate regression fixed.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar Dec. 19, 2013, 5:07 p.m. UTC | #3

* H. Peter Anvin <hpa@zytor.com> wrote:

> On 12/19/2013 08:21 AM, Peter Zijlstra wrote:
> > 
> > What's that mb for?
> > 
> 
> It already exists in mwait_idle_with_hints(); I just moved it into 
> this common function.  It is a bit odd, I have to admit; it seems 
> like it should be *before* the monitor (and possibly we should have 
> one after the CLFLUSH as well?)

Yes, I think we need a barrier before the CLFLUSH, because according 
to my reading of the Intel documentation CLFLUSH has no implicit 
ordering so it might get reordered with the store to ->flags in 
current_set_polling_and_test(), which might result in spurious wakeup 
problems again.

(And CLFLUSH is a store in a sense, so special in that the regular 
ordering for stores does not apply.)

Likewise, having a barrier before the MONITOR looks sensible as well. 
Having it _after_ monitor looks weird and is probably wrong. [It might 
have been the effects of someone seeing the spurious wakeup problems 
with realizing the true source, or so.]

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra Dec. 19, 2013, 5:25 p.m. UTC | #4

On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote:
> 
> * H. Peter Anvin <hpa@zytor.com> wrote:
> 
> > On 12/19/2013 08:21 AM, Peter Zijlstra wrote:
> > > 
> > > What's that mb for?
> > > 
> > 
> > It already exists in mwait_idle_with_hints(); I just moved it into 
> > this common function.  It is a bit odd, I have to admit; it seems 
> > like it should be *before* the monitor (and possibly we should have 
> > one after the CLFLUSH as well?)
> 
> Yes, I think we need a barrier before the CLFLUSH, because according 
> to my reading of the Intel documentation CLFLUSH has no implicit 
> ordering so it might get reordered with the store to ->flags in 
> current_set_polling_and_test(), which might result in spurious wakeup 
> problems again.

No it cannot; since current_set_polling_and_test() already has a barrier
to prevent that.

Also, the location patched by hpa doesn't actually call that at all.

That said, I would find it very strange indeed if a CLFLUSH doesn't also
flush the store buffer.

> (And CLFLUSH is a store in a sense, so special in that the regular 
> ordering for stores does not apply.)
> 
> Likewise, having a barrier before the MONITOR looks sensible as well. 
> Having it _after_ monitor looks weird and is probably wrong. [It might 
> have been the effects of someone seeing the spurious wakeup problems 
> with realizing the true source, or so.]

I again have to disagree, one would expect monitor to flush all that is
required to start the monitor -- and it actually does so. As is
testified by this extra CLFLUSH being called a bug workaround.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra Dec. 19, 2013, 5:36 p.m. UTC | #5

On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
> That said, I would find it very strange indeed if a CLFLUSH doesn't also
> flush the store buffer.

OK, it explicitly states it does not do that and you indeed need an
mfence before the clflush.


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra Dec. 19, 2013, 5:50 p.m. UTC | #6

On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote:
> > Likewise, having a barrier before the MONITOR looks sensible as well. 
> 
> I again have to disagree, one would expect monitor to flush all that is
> required to start the monitor -- and it actually does so. As is
> testified by this extra CLFLUSH being called a bug workaround.

SDM states that MONITOR is ordered like a LOAD, and a LOAD cannot pass
a previous STORE to the same address.

That said; there's enough holes in there to swim a titanic through,
seeing how MONITOR stares at an entire cacheline and LOAD/STORE order is
specified on location, whatever that means.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin Dec. 19, 2013, 6:05 p.m. UTC | #7

On 12/19/2013 09:36 AM, Peter Zijlstra wrote:
> On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
>> That said, I would find it very strange indeed if a CLFLUSH doesn't also
>> flush the store buffer.
> 
> OK, it explicitly states it does not do that and you indeed need an
> mfence before the clflush.
> 

So, MONITOR is defined to be ordered as a load, which I think should be
adequate, but I really wonder if we should have mfence on both sides of
clflush.  This now is up to 9 bytes, and perhaps pushing it a bit with
how much we would be willing to patch out.

On the other hand - the CLFLUSH seems to have worked well enough by
itself, and this is all probabilistic anyway, so perhaps we should just
leave the naked CLFLUSH in and not worry about it unless measurements
say otherwise?

	-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin Dec. 19, 2013, 6:09 p.m. UTC | #8

On 12/19/2013 09:07 AM, Ingo Molnar wrote:
> 
> Likewise, having a barrier before the MONITOR looks sensible as well. 
> Having it _after_ monitor looks weird and is probably wrong. [It might 
> have been the effects of someone seeing the spurious wakeup problems 
> with realizing the true source, or so.]
> 

Does anyone know the history of this barrier after the monitor?  I know
Len is looking for a minimal patchset that can go into -stable, and it
seems prudent to not preturb the code more than necessary, but going
forward it would be nice to know...

	-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar Dec. 19, 2013, 6:10 p.m. UTC | #9

* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote:
> > 
> > * H. Peter Anvin <hpa@zytor.com> wrote:
> > 
> > > On 12/19/2013 08:21 AM, Peter Zijlstra wrote:
> > > > 
> > > > What's that mb for?
> > > > 
> > > 
> > > It already exists in mwait_idle_with_hints(); I just moved it into 
> > > this common function.  It is a bit odd, I have to admit; it seems 
> > > like it should be *before* the monitor (and possibly we should have 
> > > one after the CLFLUSH as well?)
> > 
> > Yes, I think we need a barrier before the CLFLUSH, because according 
> > to my reading of the Intel documentation CLFLUSH has no implicit 
> > ordering so it might get reordered with the store to ->flags in 
> > current_set_polling_and_test(), which might result in spurious wakeup 
> > problems again.
> 
> No it cannot; since current_set_polling_and_test() already has a 
> barrier to prevent that.

See below:

> Also, the location patched by hpa doesn't actually call that at all.
> 
> That said, I would find it very strange indeed if a CLFLUSH doesn't 
> also flush the store buffer.

So, the Intel documentation says (sorry about the lazy-link):

  http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/intel/vtune/doc/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc31.htm

 "CLFLUSH is only ordered by the MFENCE instruction. It is not 
  guaranteed to be ordered by any other fencing, serializing or other 
  CLFLUSH instruction. For example, software can use an MFENCE 
  instruction to insure that previous stores are included in the 
  write-back."

So a specific MFENCE barrier is needed.

Also note that this wording excludes implicit serialization such as 
LOCK prefix or XCHG barriers. As it happens 
current_set_polling_and_test() uses smp_mb(), which happens to map to 
MFENCE on all CPUs that can do CLFLUSH, but that's really just an 
accident and in no way engineered.

_At minimum_ we need a prominent comment at the clflush usage site 
that we rely on the MFENCE in current_set_polling_and_test() ...

> > (And CLFLUSH is a store in a sense, so special in that the regular 
> > ordering for stores does not apply.)
> > 
> > Likewise, having a barrier before the MONITOR looks sensible as 
> > well. Having it _after_ monitor looks weird and is probably wrong. 
> > [It might have been the effects of someone seeing the spurious 
> > wakeup problems with realizing the true source, or so.]
> 
> I again have to disagree, one would expect monitor to flush all that 
> is required to start the monitor -- and it actually does so. As is 
> testified by this extra CLFLUSH being called a bug workaround.

This assumption would be safer - although AFAICS the Intel 
MONITOR/MWAIT documentation is quiet about this aspect.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar Dec. 19, 2013, 6:14 p.m. UTC | #10

* H. Peter Anvin <hpa@zytor.com> wrote:

> On 12/19/2013 09:36 AM, Peter Zijlstra wrote:
> > On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
> >> That said, I would find it very strange indeed if a CLFLUSH doesn't also
> >> flush the store buffer.
> > 
> > OK, it explicitly states it does not do that and you indeed need 
> > an mfence before the clflush.
> 
> So, MONITOR is defined to be ordered as a load, which I think should 
> be adequate, but I really wonder if we should have mfence on both 
> sides of clflush.  This now is up to 9 bytes, and perhaps pushing it 
> a bit with how much we would be willing to patch out.
> 
> On the other hand - the CLFLUSH seems to have worked well enough by 
> itself, and this is all probabilistic anyway, so perhaps we should 
> just leave the naked CLFLUSH in and not worry about it unless 
> measurements say otherwise?

So I think the window of breakage was rather large here, and since it 
seems to trigger on rare types of hardware I think we'd be better off 
by erring on the side of robustness this time around ...

This is the 'go to idle' path, which isn't as time-critical as the 
'get out of idle' code path.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar Dec. 19, 2013, 6:18 p.m. UTC | #11

* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
> > On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote:
> > > Likewise, having a barrier before the MONITOR looks sensible as well. 
> > 
> > I again have to disagree, one would expect monitor to flush all that is
> > required to start the monitor -- and it actually does so. As is
> > testified by this extra CLFLUSH being called a bug workaround.
> 
> SDM states that MONITOR is ordered like a LOAD, and a LOAD cannot 
> pass a previous STORE to the same address.

Yes ... but you could argue that CLFLUSH is neither a load nor a 
store, it's a _cache sync_ operation, with its special ordering 
properties.

> That said; there's enough holes in there to swim a titanic through, 
> seeing how MONITOR stares at an entire cacheline and LOAD/STORE 
> order is specified on location, whatever that means.

I think assuming that MONITOR is ordered as a load or better is a 
pretty safe one (and in fact the Intel documentation seems to say so) 
- I'd say MONITOR is in micro-code and essentially snoops on cache 
events on that specific cache line, and loads the cache line on a 
snoop hit?

Btw., what state is the cache line after a MONITOR instruction, is it 
loaded as shared, or as excusive? (exclusive would probably be better 
for performance.)

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin Dec. 19, 2013, 6:19 p.m. UTC | #12

On 12/19/2013 10:09 AM, H. Peter Anvin wrote:
> On 12/19/2013 09:07 AM, Ingo Molnar wrote:
>>
>> Likewise, having a barrier before the MONITOR looks sensible as well. 
>> Having it _after_ monitor looks weird and is probably wrong. [It might 
>> have been the effects of someone seeing the spurious wakeup problems 
>> with realizing the true source, or so.]
>>
> 
> Does anyone know the history of this barrier after the monitor?  I know
> Len is looking for a minimal patchset that can go into -stable, and it
> seems prudent to not preturb the code more than necessary, but going
> forward it would be nice to know...
> 

Hmm... it *looks* like it is intended to be part of the construct:

	smp_mb();
	if (!need_resched())
		...

I found a note in the HLT variant of the function saying:

/*
 * TS_POLLING-cleared state must be visible before we
 * test NEED_RESCHED:
 */

... which presumably has been copied elsewhere.

	-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar Dec. 19, 2013, 6:19 p.m. UTC | #13

* H. Peter Anvin <hpa@zytor.com> wrote:

> On 12/19/2013 09:07 AM, Ingo Molnar wrote:
> > 
> > Likewise, having a barrier before the MONITOR looks sensible as 
> > well. Having it _after_ monitor looks weird and is probably wrong. 
> > [It might have been the effects of someone seeing the spurious 
> > wakeup problems with realizing the true source, or so.]
> 
> Does anyone know the history of this barrier after the monitor?  I 
> know Len is looking for a minimal patchset that can go into -stable, 
> and it seems prudent to not preturb the code more than necessary, 
> but going forward it would be nice to know...

For the minimal fix I don't think we should change it - but for v3.14 
it looks like a speedup for the from-idle codepath, which is 
performance sensitive.

( It would also be nice to know whether MONITOR loads that cacheline 
  into the CPUs cache, and in what state it loads it. )

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar Dec. 19, 2013, 6:23 p.m. UTC | #14

* H. Peter Anvin <hpa@zytor.com> wrote:

> On 12/19/2013 10:09 AM, H. Peter Anvin wrote:
> > On 12/19/2013 09:07 AM, Ingo Molnar wrote:
> >>
> >> Likewise, having a barrier before the MONITOR looks sensible as well. 
> >> Having it _after_ monitor looks weird and is probably wrong. [It might 
> >> have been the effects of someone seeing the spurious wakeup problems 
> >> with realizing the true source, or so.]
> >>
> > 
> > Does anyone know the history of this barrier after the monitor?  I know
> > Len is looking for a minimal patchset that can go into -stable, and it
> > seems prudent to not preturb the code more than necessary, but going
> > forward it would be nice to know...
> > 
> 
> Hmm... it *looks* like it is intended to be part of the construct:
> 
> 	smp_mb();
> 	if (!need_resched())
> 		...
> 
> I found a note in the HLT variant of the function saying:
> 
> /*
>  * TS_POLLING-cleared state must be visible before we
>  * test NEED_RESCHED:
>  */

Yes, that makes sense: the need_resched test is a load, and MONITOR is 
a load as well. Can the two ever cross, or does the CPU guarantee that 
because it's the same address, the loads don't cross?

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar Dec. 19, 2013, 6:43 p.m. UTC | #15

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The x86 memory rules are that two loads always execute in order (ie 
> rmb is a no-op).
> 
> So I see no reason for a memory barrier after the monitor. [...]

Yes, I'm leaning towards that interpretation as well, but the reason 
I'm a bit catious is the somewhat curious (to me!) wording of the 
MONITOR instruction:

  Sets up a linear address range to be monitored by hardware and 
  activates the monitor.

  ...

  The MONITOR instruction is ordered as a load operation with respect 
  to other memory transactions. The instruction can be used at all 
  privilege levels and is subject to all permission checking and 
  faults associated with a byte load. Like a load, the MONITOR 
  instruction sets the A-bit but not the D-bit in page tables.

Where apparently the 'range' means 'full cache line surrounding the 
memory address in question'.

We have no other load instructions that operate on such a large 
'range' of addresses, and I wanted to make sure it's a true (single 
byte) load for that specific address. The documentation does not 
appear to explicitly state that it's a load for that address - only 
that it's ordered as a load.

The reason I'm asking is because 'flags' itself might not be at the 
beginning of the cache line, as it's in the middle of thread_info:

 struct thread_info {
        struct task_struct      *task;          /* main task structure */
        struct exec_domain      *exec_domain;   /* execution domain */
        __u32                   flags;          /* low level flags */

while 'MONITOR' appears to work on the cache line. So are all 
addresses within that cache line ordered? Only the specific address 
given to the instruction itself? Only the first word of the cacheline 
itself?

The documentation is a bit vague, at least in my reading, and 
depending on which actual word the instruction reads (if it reads any 
word at all ... it's probably just setting up an address for MWAIT) 
from that cacheline, its ordering properties might be surprising.

> [...] But both sides of clflush sounds sane, and as mentioned the 
> "go to sleep" side isn't as critical as the "wake up" side if the 
> monitor.

Yeah.

> Please let's just make that pre-monitor hack be a static key, and do 
> mfence explicitly around the clflush inside that conditional 
> section.

Agreed.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin Dec. 19, 2013, 7:22 p.m. UTC | #16

On 12/19/2013 10:19 AM, Ingo Molnar wrote:
> 
> ( It would also be nice to know whether MONITOR loads that cacheline 
>   into the CPUs cache, and in what state it loads it. )
> 

I would assume that is implementation-dependent.  However, one plausible
implementation is to load the cache line into the cache in shared state
and monitor for evictions.

	-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra Dec. 19, 2013, 7:27 p.m. UTC | #17

On Thu, Dec 19, 2013 at 11:22:01AM -0800, H. Peter Anvin wrote:
> On 12/19/2013 10:19 AM, Ingo Molnar wrote:
> > 
> > ( It would also be nice to know whether MONITOR loads that cacheline 
> >   into the CPUs cache, and in what state it loads it. )
> > 
> 
> I would assume that is implementation-dependent.  However, one plausible
> implementation is to load the cache line into the cache in shared state
> and monitor for evictions.

I suppose the monitor part is the important part because certain C
states drop all cache, presumably including the cacheline we're actually
monitoring.


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin Dec. 19, 2013, 9:05 p.m. UTC | #18

CLFLUSH is only (guaranteed to be) ordered with respect to MFENCE according to the SDM.

Ingo Molnar <mingo@kernel.org> wrote:
>
>* Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
>> > On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote:
>> > > Likewise, having a barrier before the MONITOR looks sensible as
>well. 
>> > 
>> > I again have to disagree, one would expect monitor to flush all
>that is
>> > required to start the monitor -- and it actually does so. As is
>> > testified by this extra CLFLUSH being called a bug workaround.
>> 
>> SDM states that MONITOR is ordered like a LOAD, and a LOAD cannot 
>> pass a previous STORE to the same address.
>
>Yes ... but you could argue that CLFLUSH is neither a load nor a 
>store, it's a _cache sync_ operation, with its special ordering 
>properties.
>
>> That said; there's enough holes in there to swim a titanic through, 
>> seeing how MONITOR stares at an entire cacheline and LOAD/STORE 
>> order is specified on location, whatever that means.
>
>I think assuming that MONITOR is ordered as a load or better is a 
>pretty safe one (and in fact the Intel documentation seems to say so) 
>- I'd say MONITOR is in micro-code and essentially snoops on cache 
>events on that specific cache line, and loads the cache line on a 
>snoop hit?
>
>Btw., what state is the cache line after a MONITOR instruction, is it 
>loaded as shared, or as excusive? (exclusive would probably be better 
>for performance.)
>
>Thanks,
>
>	Ingo

Ingo Molnar Dec. 19, 2013, 9:17 p.m. UTC | #19

* H. Peter Anvin <hpa@zytor.com> wrote:

> CLFLUSH is only (guaranteed to be) ordered with respect to MFENCE 
> according to the SDM.

Yes, I quoted the relevant text a couple of hours ago, CLFLUSH is 
pretty special, so it's not ordered against atomics or serializing 
instructions for example - only ordered against explict MFENCE calls, 
which on x86 is mb()/smp_mb().

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

x86 idle: repair large-server 50-watt idle-power regression

Commit Message

Comments

Patch