diff mbox series

[2/9] x86: limit the amount of TLB flushing in switch_cr3_cr4()

Message ID c842ae3e-962d-0272-eff6-c5f517fee7c9@suse.com (mailing list archive)
State Superseded
Headers show
Series XSA-292 follow-up | expand

Commit Message

Jan Beulich Sept. 11, 2019, 3:22 p.m. UTC
We really need to flush the TLB just once, if we do so with or after the
CR3 write. The only case where two flushes are unavoidable is when we
mean to turn off CR4.PGE (perhaps just temporarily; see the code
comment).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

Comments

Roger Pau Monné Sept. 12, 2019, 9:54 a.m. UTC | #1
On Wed, Sep 11, 2019 at 05:22:17PM +0200, Jan Beulich wrote:
> We really need to flush the TLB just once, if we do so with or after the
> CR3 write. The only case where two flushes are unavoidable is when we
> mean to turn off CR4.PGE (perhaps just temporarily; see the code
> comment).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Thanks, this seems to make the logic of the function easier, but I'm
slightly worried about the performance impact given that a full flush
of all PCID contexts is done instead of the previous selective flush.

> --- a/xen/arch/x86/flushtlb.c
> +++ b/xen/arch/x86/flushtlb.c
> @@ -104,82 +104,65 @@ static void do_tlb_flush(void)
>  void switch_cr3_cr4(unsigned long cr3, unsigned long cr4)
>  {
>      unsigned long flags, old_cr4;
> -    unsigned int old_pcid;
>      u32 t;
>  
> +    /* Throughout this function we make this assumption: */
> +    ASSERT(!(cr4 & X86_CR4_PCIDE) || !(cr4 & X86_CR4_PGE));
> +
>      /* This non-reentrant function is sometimes called in interrupt context. */
>      local_irq_save(flags);
>  
>      t = pre_flush();
>  
>      old_cr4 = read_cr4();
> -    if ( old_cr4 & X86_CR4_PGE )
> +    ASSERT(!(old_cr4 & X86_CR4_PCIDE) || !(old_cr4 & X86_CR4_PGE));
> +
> +    /*
> +     * We need to write CR4 before CR3 if we're about to enable PCIDE, at the
> +     * very least when the new PCID is non-zero.
> +     *
> +     * As we also need to do two CR4 writes in total when PGE is enabled and
> +     * is to remain enabled, do the one temporarily turning off the bit right
> +     * here as well.
> +     *
> +     * The only TLB flushing effect we depend on here is in case we move from
> +     * PGE set to PCIDE set, where we want global page entries gone (and none
> +     * to re-appear) after this write.
> +     */
> +    if ( !(old_cr4 & X86_CR4_PCIDE) &&
> +         ((cr4 & X86_CR4_PCIDE) || (cr4 & old_cr4 & X86_CR4_PGE)) )
>      {
> -        /*
> -         * X86_CR4_PGE set means PCID is inactive.
> -         * We have to purge the TLB via flipping cr4.pge.
> -         */
>          old_cr4 = cr4 & ~X86_CR4_PGE;
>          write_cr4(old_cr4);
>      }
> -    else if ( use_invpcid )
> -    {
> -        /*
> -         * Flushing the TLB via INVPCID is necessary only in case PCIDs are
> -         * in use, which is true only with INVPCID being available.
> -         * Without PCID usage the following write_cr3() will purge the TLB
> -         * (we are in the cr4.pge off path) of all entries.
> -         * Using invpcid_flush_all_nonglobals() seems to be faster than
> -         * invpcid_flush_all(), so use that.
> -         */
> -        invpcid_flush_all_nonglobals();
> -
> -        /*
> -         * CR4.PCIDE needs to be set before the CR3 write below. Otherwise
> -         * - the CR3 write will fault when CR3.NOFLUSH is set (which is the
> -         *   case normally),
> -         * - the subsequent CR4 write will fault if CR3.PCID != 0.
> -         */
> -        if ( (old_cr4 & X86_CR4_PCIDE) < (cr4 & X86_CR4_PCIDE) )
> -        {
> -            write_cr4(cr4);
> -            old_cr4 = cr4;
> -        }
> -    }
>  
>      /*
> -     * If we don't change PCIDs, the CR3 write below needs to flush this very
> -     * PCID, even when a full flush was performed above, as we are currently
> -     * accumulating TLB entries again from the old address space.
> -     * NB: Clearing the bit when we don't use PCID is benign (as it is clear
> -     * already in that case), but allows the if() to be more simple.
> +     * If the CR4 write is to turn off PCIDE, we don't need the CR3 write to
> +     * flush anything, as that transition is a full flush itself.
>       */
> -    old_pcid = cr3_pcid(read_cr3());
> -    if ( old_pcid == cr3_pcid(cr3) )
> -        cr3 &= ~X86_CR3_NOFLUSH;
> -
> +    if ( (old_cr4 & X86_CR4_PCIDE) > (cr4 & X86_CR4_PCIDE) )
> +        cr3 |= X86_CR3_NOFLUSH;
>      write_cr3(cr3);
>  
>      if ( old_cr4 != cr4 )
>          write_cr4(cr4);
>  
>      /*
> -     * Make sure no TLB entries related to the old PCID created between
> -     * flushing the TLB and writing the new %cr3 value remain in the TLB.
> -     *
> -     * The write to CR4 just above has performed a wider flush in certain
> -     * cases, which therefore get excluded here. Since that write is
> -     * conditional, note in particular that it won't be skipped if PCIDE
> -     * transitions from 1 to 0. This is because the CR4 write further up will
> -     * have been skipped in this case, as PCIDE and PGE won't both be set at
> -     * the same time.
> -     *
> -     * Note also that PGE is always clear in old_cr4.
> +     *  PGE  | PCIDE | flush at
> +     * ------+-------+------------------------
> +     *  0->0 | 0->0  | CR3 write
> +     *  0->0 | 0->1  | n/a (see 1st CR4 write)
> +     *  0->x | 1->0  | CR4 write
> +     *  x->1 | x->1  | n/a
> +     *  0->0 | 1->1  | INVPCID
> +     *  0->1 | 0->0  | CR3 and CR4 writes
> +     *  1->0 | 0->0  | CR4 write
> +     *  1->0 | 0->1  | n/a (see 1st CR4 write)
> +     *  1->1 | 0->0  | n/a (see 1st CR4 write)
> +     *  1->x | 1->x  | n/a
>       */
> -    if ( old_pcid != cr3_pcid(cr3) &&

You seem to have dropped all the users of cr3_pcid, I guess the
function is not removed because you plan to use it in other sites?

> -         !(cr4 & X86_CR4_PGE) &&
> -         (old_cr4 & X86_CR4_PCIDE) <= (cr4 & X86_CR4_PCIDE) )
> -        invpcid_flush_single_context(old_pcid);
> +    if ( cr4 & X86_CR4_PCIDE )
> +        invpcid_flush_all_nonglobals();

Isn't this going to be quite expensive compared to the single PCID
flushing done before? (ie: invpcid_flush_single_context vs
invpcid_flush_all_nonglobals)

Thanks, Roger.
Jan Beulich Sept. 12, 2019, 10:11 a.m. UTC | #2
On 12.09.2019 11:54, Roger Pau Monné  wrote:
> On Wed, Sep 11, 2019 at 05:22:17PM +0200, Jan Beulich wrote:
>> We really need to flush the TLB just once, if we do so with or after the
>> CR3 write. The only case where two flushes are unavoidable is when we
>> mean to turn off CR4.PGE (perhaps just temporarily; see the code
>> comment).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Thanks, this seems to make the logic of the function easier, but I'm
> slightly worried about the performance impact given that a full flush
> of all PCID contexts is done instead of the previous selective flush.

I think you've misunderstood:

>> --- a/xen/arch/x86/flushtlb.c
>> +++ b/xen/arch/x86/flushtlb.c
>> @@ -104,82 +104,65 @@ static void do_tlb_flush(void)
>>  void switch_cr3_cr4(unsigned long cr3, unsigned long cr4)
>>  {
>>      unsigned long flags, old_cr4;
>> -    unsigned int old_pcid;
>>      u32 t;
>>  
>> +    /* Throughout this function we make this assumption: */
>> +    ASSERT(!(cr4 & X86_CR4_PCIDE) || !(cr4 & X86_CR4_PGE));
>> +
>>      /* This non-reentrant function is sometimes called in interrupt context. */
>>      local_irq_save(flags);
>>  
>>      t = pre_flush();
>>  
>>      old_cr4 = read_cr4();
>> -    if ( old_cr4 & X86_CR4_PGE )
>> +    ASSERT(!(old_cr4 & X86_CR4_PCIDE) || !(old_cr4 & X86_CR4_PGE));
>> +
>> +    /*
>> +     * We need to write CR4 before CR3 if we're about to enable PCIDE, at the
>> +     * very least when the new PCID is non-zero.
>> +     *
>> +     * As we also need to do two CR4 writes in total when PGE is enabled and
>> +     * is to remain enabled, do the one temporarily turning off the bit right
>> +     * here as well.
>> +     *
>> +     * The only TLB flushing effect we depend on here is in case we move from
>> +     * PGE set to PCIDE set, where we want global page entries gone (and none
>> +     * to re-appear) after this write.
>> +     */
>> +    if ( !(old_cr4 & X86_CR4_PCIDE) &&
>> +         ((cr4 & X86_CR4_PCIDE) || (cr4 & old_cr4 & X86_CR4_PGE)) )
>>      {
>> -        /*
>> -         * X86_CR4_PGE set means PCID is inactive.
>> -         * We have to purge the TLB via flipping cr4.pge.
>> -         */
>>          old_cr4 = cr4 & ~X86_CR4_PGE;
>>          write_cr4(old_cr4);
>>      }
>> -    else if ( use_invpcid )
>> -    {
>> -        /*
>> -         * Flushing the TLB via INVPCID is necessary only in case PCIDs are
>> -         * in use, which is true only with INVPCID being available.
>> -         * Without PCID usage the following write_cr3() will purge the TLB
>> -         * (we are in the cr4.pge off path) of all entries.
>> -         * Using invpcid_flush_all_nonglobals() seems to be faster than
>> -         * invpcid_flush_all(), so use that.
>> -         */
>> -        invpcid_flush_all_nonglobals();

This simply gets moved, while ...

>> -        /*
>> -         * CR4.PCIDE needs to be set before the CR3 write below. Otherwise
>> -         * - the CR3 write will fault when CR3.NOFLUSH is set (which is the
>> -         *   case normally),
>> -         * - the subsequent CR4 write will fault if CR3.PCID != 0.
>> -         */
>> -        if ( (old_cr4 & X86_CR4_PCIDE) < (cr4 & X86_CR4_PCIDE) )
>> -        {
>> -            write_cr4(cr4);
>> -            old_cr4 = cr4;
>> -        }
>> -    }
>>  
>>      /*
>> -     * If we don't change PCIDs, the CR3 write below needs to flush this very
>> -     * PCID, even when a full flush was performed above, as we are currently
>> -     * accumulating TLB entries again from the old address space.
>> -     * NB: Clearing the bit when we don't use PCID is benign (as it is clear
>> -     * already in that case), but allows the if() to be more simple.
>> +     * If the CR4 write is to turn off PCIDE, we don't need the CR3 write to
>> +     * flush anything, as that transition is a full flush itself.
>>       */
>> -    old_pcid = cr3_pcid(read_cr3());
>> -    if ( old_pcid == cr3_pcid(cr3) )
>> -        cr3 &= ~X86_CR3_NOFLUSH;
>> -
>> +    if ( (old_cr4 & X86_CR4_PCIDE) > (cr4 & X86_CR4_PCIDE) )
>> +        cr3 |= X86_CR3_NOFLUSH;
>>      write_cr3(cr3);
>>  
>>      if ( old_cr4 != cr4 )
>>          write_cr4(cr4);
>>  
>>      /*
>> -     * Make sure no TLB entries related to the old PCID created between
>> -     * flushing the TLB and writing the new %cr3 value remain in the TLB.
>> -     *
>> -     * The write to CR4 just above has performed a wider flush in certain
>> -     * cases, which therefore get excluded here. Since that write is
>> -     * conditional, note in particular that it won't be skipped if PCIDE
>> -     * transitions from 1 to 0. This is because the CR4 write further up will
>> -     * have been skipped in this case, as PCIDE and PGE won't both be set at
>> -     * the same time.
>> -     *
>> -     * Note also that PGE is always clear in old_cr4.
>> +     *  PGE  | PCIDE | flush at
>> +     * ------+-------+------------------------
>> +     *  0->0 | 0->0  | CR3 write
>> +     *  0->0 | 0->1  | n/a (see 1st CR4 write)
>> +     *  0->x | 1->0  | CR4 write
>> +     *  x->1 | x->1  | n/a
>> +     *  0->0 | 1->1  | INVPCID
>> +     *  0->1 | 0->0  | CR3 and CR4 writes
>> +     *  1->0 | 0->0  | CR4 write
>> +     *  1->0 | 0->1  | n/a (see 1st CR4 write)
>> +     *  1->1 | 0->0  | n/a (see 1st CR4 write)
>> +     *  1->x | 1->x  | n/a
>>       */
>> -    if ( old_pcid != cr3_pcid(cr3) &&
> 
> You seem to have dropped all the users of cr3_pcid, I guess the
> function is not removed because you plan to use it in other sites?
> 
>> -         !(cr4 & X86_CR4_PGE) &&
>> -         (old_cr4 & X86_CR4_PCIDE) <= (cr4 & X86_CR4_PCIDE) )
>> -        invpcid_flush_single_context(old_pcid);
>> +    if ( cr4 & X86_CR4_PCIDE )
>> +        invpcid_flush_all_nonglobals();
> 
> Isn't this going to be quite expensive compared to the single PCID
> flushing done before? (ie: invpcid_flush_single_context vs
> invpcid_flush_all_nonglobals)

... the invpcid_flush_single_context() gets eliminated altogether
(by doing the main flush _after_ the control register writes).

As to cr3_pcid() - the function is valid to have in case of future
use (e.g. in HVM code), so I didn't see a point in deleting it.

Jan
Roger Pau Monné Sept. 12, 2019, 10:38 a.m. UTC | #3
On Thu, Sep 12, 2019 at 12:11:55PM +0200, Jan Beulich wrote:
> On 12.09.2019 11:54, Roger Pau Monné  wrote:
> > On Wed, Sep 11, 2019 at 05:22:17PM +0200, Jan Beulich wrote:
> >> We really need to flush the TLB just once, if we do so with or after the
> >> CR3 write. The only case where two flushes are unavoidable is when we
> >> mean to turn off CR4.PGE (perhaps just temporarily; see the code
> >> comment).
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Thanks, this seems to make the logic of the function easier, but I'm
> > slightly worried about the performance impact given that a full flush
> > of all PCID contexts is done instead of the previous selective flush.
> 
> I think you've misunderstood:
> 
> >> --- a/xen/arch/x86/flushtlb.c
> >> +++ b/xen/arch/x86/flushtlb.c
> >> @@ -104,82 +104,65 @@ static void do_tlb_flush(void)
> >>  void switch_cr3_cr4(unsigned long cr3, unsigned long cr4)
> >>  {
> >>      unsigned long flags, old_cr4;
> >> -    unsigned int old_pcid;
> >>      u32 t;
> >>  
> >> +    /* Throughout this function we make this assumption: */
> >> +    ASSERT(!(cr4 & X86_CR4_PCIDE) || !(cr4 & X86_CR4_PGE));
> >> +
> >>      /* This non-reentrant function is sometimes called in interrupt context. */
> >>      local_irq_save(flags);
> >>  
> >>      t = pre_flush();
> >>  
> >>      old_cr4 = read_cr4();
> >> -    if ( old_cr4 & X86_CR4_PGE )
> >> +    ASSERT(!(old_cr4 & X86_CR4_PCIDE) || !(old_cr4 & X86_CR4_PGE));
> >> +
> >> +    /*
> >> +     * We need to write CR4 before CR3 if we're about to enable PCIDE, at the
> >> +     * very least when the new PCID is non-zero.
> >> +     *
> >> +     * As we also need to do two CR4 writes in total when PGE is enabled and
> >> +     * is to remain enabled, do the one temporarily turning off the bit right
> >> +     * here as well.
> >> +     *
> >> +     * The only TLB flushing effect we depend on here is in case we move from
> >> +     * PGE set to PCIDE set, where we want global page entries gone (and none
> >> +     * to re-appear) after this write.
> >> +     */
> >> +    if ( !(old_cr4 & X86_CR4_PCIDE) &&
> >> +         ((cr4 & X86_CR4_PCIDE) || (cr4 & old_cr4 & X86_CR4_PGE)) )
> >>      {
> >> -        /*
> >> -         * X86_CR4_PGE set means PCID is inactive.
> >> -         * We have to purge the TLB via flipping cr4.pge.
> >> -         */
> >>          old_cr4 = cr4 & ~X86_CR4_PGE;
> >>          write_cr4(old_cr4);
> >>      }
> >> -    else if ( use_invpcid )
> >> -    {
> >> -        /*
> >> -         * Flushing the TLB via INVPCID is necessary only in case PCIDs are
> >> -         * in use, which is true only with INVPCID being available.
> >> -         * Without PCID usage the following write_cr3() will purge the TLB
> >> -         * (we are in the cr4.pge off path) of all entries.
> >> -         * Using invpcid_flush_all_nonglobals() seems to be faster than
> >> -         * invpcid_flush_all(), so use that.
> >> -         */
> >> -        invpcid_flush_all_nonglobals();
> 
> This simply gets moved, while ...
> 
> >> -        /*
> >> -         * CR4.PCIDE needs to be set before the CR3 write below. Otherwise
> >> -         * - the CR3 write will fault when CR3.NOFLUSH is set (which is the
> >> -         *   case normally),
> >> -         * - the subsequent CR4 write will fault if CR3.PCID != 0.
> >> -         */
> >> -        if ( (old_cr4 & X86_CR4_PCIDE) < (cr4 & X86_CR4_PCIDE) )
> >> -        {
> >> -            write_cr4(cr4);
> >> -            old_cr4 = cr4;
> >> -        }
> >> -    }
> >>  
> >>      /*
> >> -     * If we don't change PCIDs, the CR3 write below needs to flush this very
> >> -     * PCID, even when a full flush was performed above, as we are currently
> >> -     * accumulating TLB entries again from the old address space.
> >> -     * NB: Clearing the bit when we don't use PCID is benign (as it is clear
> >> -     * already in that case), but allows the if() to be more simple.
> >> +     * If the CR4 write is to turn off PCIDE, we don't need the CR3 write to
> >> +     * flush anything, as that transition is a full flush itself.
> >>       */
> >> -    old_pcid = cr3_pcid(read_cr3());
> >> -    if ( old_pcid == cr3_pcid(cr3) )
> >> -        cr3 &= ~X86_CR3_NOFLUSH;
> >> -
> >> +    if ( (old_cr4 & X86_CR4_PCIDE) > (cr4 & X86_CR4_PCIDE) )
> >> +        cr3 |= X86_CR3_NOFLUSH;
> >>      write_cr3(cr3);
> >>  
> >>      if ( old_cr4 != cr4 )
> >>          write_cr4(cr4);
> >>  
> >>      /*
> >> -     * Make sure no TLB entries related to the old PCID created between
> >> -     * flushing the TLB and writing the new %cr3 value remain in the TLB.
> >> -     *
> >> -     * The write to CR4 just above has performed a wider flush in certain
> >> -     * cases, which therefore get excluded here. Since that write is
> >> -     * conditional, note in particular that it won't be skipped if PCIDE
> >> -     * transitions from 1 to 0. This is because the CR4 write further up will
> >> -     * have been skipped in this case, as PCIDE and PGE won't both be set at
> >> -     * the same time.
> >> -     *
> >> -     * Note also that PGE is always clear in old_cr4.
> >> +     *  PGE  | PCIDE | flush at
> >> +     * ------+-------+------------------------
> >> +     *  0->0 | 0->0  | CR3 write
> >> +     *  0->0 | 0->1  | n/a (see 1st CR4 write)
> >> +     *  0->x | 1->0  | CR4 write
> >> +     *  x->1 | x->1  | n/a
> >> +     *  0->0 | 1->1  | INVPCID
> >> +     *  0->1 | 0->0  | CR3 and CR4 writes
> >> +     *  1->0 | 0->0  | CR4 write
> >> +     *  1->0 | 0->1  | n/a (see 1st CR4 write)
> >> +     *  1->1 | 0->0  | n/a (see 1st CR4 write)
> >> +     *  1->x | 1->x  | n/a
> >>       */
> >> -    if ( old_pcid != cr3_pcid(cr3) &&
> > 
> > You seem to have dropped all the users of cr3_pcid, I guess the
> > function is not removed because you plan to use it in other sites?
> > 
> >> -         !(cr4 & X86_CR4_PGE) &&
> >> -         (old_cr4 & X86_CR4_PCIDE) <= (cr4 & X86_CR4_PCIDE) )
> >> -        invpcid_flush_single_context(old_pcid);
> >> +    if ( cr4 & X86_CR4_PCIDE )
> >> +        invpcid_flush_all_nonglobals();
> > 
> > Isn't this going to be quite expensive compared to the single PCID
> > flushing done before? (ie: invpcid_flush_single_context vs
> > invpcid_flush_all_nonglobals)
> 
> ... the invpcid_flush_single_context() gets eliminated altogether
> (by doing the main flush _after_ the control register writes).

Oh, thanks, I've certainly missed this move, sorry.

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
diff mbox series

Patch

--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -104,82 +104,65 @@  static void do_tlb_flush(void)
 void switch_cr3_cr4(unsigned long cr3, unsigned long cr4)
 {
     unsigned long flags, old_cr4;
-    unsigned int old_pcid;
     u32 t;
 
+    /* Throughout this function we make this assumption: */
+    ASSERT(!(cr4 & X86_CR4_PCIDE) || !(cr4 & X86_CR4_PGE));
+
     /* This non-reentrant function is sometimes called in interrupt context. */
     local_irq_save(flags);
 
     t = pre_flush();
 
     old_cr4 = read_cr4();
-    if ( old_cr4 & X86_CR4_PGE )
+    ASSERT(!(old_cr4 & X86_CR4_PCIDE) || !(old_cr4 & X86_CR4_PGE));
+
+    /*
+     * We need to write CR4 before CR3 if we're about to enable PCIDE, at the
+     * very least when the new PCID is non-zero.
+     *
+     * As we also need to do two CR4 writes in total when PGE is enabled and
+     * is to remain enabled, do the one temporarily turning off the bit right
+     * here as well.
+     *
+     * The only TLB flushing effect we depend on here is in case we move from
+     * PGE set to PCIDE set, where we want global page entries gone (and none
+     * to re-appear) after this write.
+     */
+    if ( !(old_cr4 & X86_CR4_PCIDE) &&
+         ((cr4 & X86_CR4_PCIDE) || (cr4 & old_cr4 & X86_CR4_PGE)) )
     {
-        /*
-         * X86_CR4_PGE set means PCID is inactive.
-         * We have to purge the TLB via flipping cr4.pge.
-         */
         old_cr4 = cr4 & ~X86_CR4_PGE;
         write_cr4(old_cr4);
     }
-    else if ( use_invpcid )
-    {
-        /*
-         * Flushing the TLB via INVPCID is necessary only in case PCIDs are
-         * in use, which is true only with INVPCID being available.
-         * Without PCID usage the following write_cr3() will purge the TLB
-         * (we are in the cr4.pge off path) of all entries.
-         * Using invpcid_flush_all_nonglobals() seems to be faster than
-         * invpcid_flush_all(), so use that.
-         */
-        invpcid_flush_all_nonglobals();
-
-        /*
-         * CR4.PCIDE needs to be set before the CR3 write below. Otherwise
-         * - the CR3 write will fault when CR3.NOFLUSH is set (which is the
-         *   case normally),
-         * - the subsequent CR4 write will fault if CR3.PCID != 0.
-         */
-        if ( (old_cr4 & X86_CR4_PCIDE) < (cr4 & X86_CR4_PCIDE) )
-        {
-            write_cr4(cr4);
-            old_cr4 = cr4;
-        }
-    }
 
     /*
-     * If we don't change PCIDs, the CR3 write below needs to flush this very
-     * PCID, even when a full flush was performed above, as we are currently
-     * accumulating TLB entries again from the old address space.
-     * NB: Clearing the bit when we don't use PCID is benign (as it is clear
-     * already in that case), but allows the if() to be more simple.
+     * If the CR4 write is to turn off PCIDE, we don't need the CR3 write to
+     * flush anything, as that transition is a full flush itself.
      */
-    old_pcid = cr3_pcid(read_cr3());
-    if ( old_pcid == cr3_pcid(cr3) )
-        cr3 &= ~X86_CR3_NOFLUSH;
-
+    if ( (old_cr4 & X86_CR4_PCIDE) > (cr4 & X86_CR4_PCIDE) )
+        cr3 |= X86_CR3_NOFLUSH;
     write_cr3(cr3);
 
     if ( old_cr4 != cr4 )
         write_cr4(cr4);
 
     /*
-     * Make sure no TLB entries related to the old PCID created between
-     * flushing the TLB and writing the new %cr3 value remain in the TLB.
-     *
-     * The write to CR4 just above has performed a wider flush in certain
-     * cases, which therefore get excluded here. Since that write is
-     * conditional, note in particular that it won't be skipped if PCIDE
-     * transitions from 1 to 0. This is because the CR4 write further up will
-     * have been skipped in this case, as PCIDE and PGE won't both be set at
-     * the same time.
-     *
-     * Note also that PGE is always clear in old_cr4.
+     *  PGE  | PCIDE | flush at
+     * ------+-------+------------------------
+     *  0->0 | 0->0  | CR3 write
+     *  0->0 | 0->1  | n/a (see 1st CR4 write)
+     *  0->x | 1->0  | CR4 write
+     *  x->1 | x->1  | n/a
+     *  0->0 | 1->1  | INVPCID
+     *  0->1 | 0->0  | CR3 and CR4 writes
+     *  1->0 | 0->0  | CR4 write
+     *  1->0 | 0->1  | n/a (see 1st CR4 write)
+     *  1->1 | 0->0  | n/a (see 1st CR4 write)
+     *  1->x | 1->x  | n/a
      */
-    if ( old_pcid != cr3_pcid(cr3) &&
-         !(cr4 & X86_CR4_PGE) &&
-         (old_cr4 & X86_CR4_PCIDE) <= (cr4 & X86_CR4_PCIDE) )
-        invpcid_flush_single_context(old_pcid);
+    if ( cr4 & X86_CR4_PCIDE )
+        invpcid_flush_all_nonglobals();
 
     post_flush(t);