[for-4.14,v2] x86/tlb: fix assisted flush usage

Message ID	20200623145006.66723-1-roger.pau@citrix.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <SRS0=syuX=AE=lists.xenproject.org=xen-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D099D20723 IronPort-SDR: bcwLMcIj6TTlOhEnThpALzlm3WOZdDze10nc2384Hgqok2UEZwv32IVJtZC+TCDLG5ZTQh4NQ8 /lDGD342TMpTJ2oxVQW1qP6V11ISGLjjffJUgwx1FLQLvEzLbzl2Ww9Vhy+mKjshSCKonU+qel fQxye3MOtbK8+HNFUf+fjT+LaZ6PWH0lN1IuqAd8F4b1sbO5E+5M/MmZ3A7XLP9DXiLtxS8O8n irsvs+Yscc+dJ8f3xudRLK8UTuccLOHapOEtKjphHKmSQ4yRmjUmqjes5InjZp+tPy8HwC/llM ge4= From: Roger Pau Monne <roger.pau@citrix.com> To: <xen-devel@lists.xenproject.org> Subject: [PATCH for-4.14 v2] x86/tlb: fix assisted flush usage Date: Tue, 23 Jun 2020 16:50:06 +0200 Message-ID: <20200623145006.66723-1-roger.pau@citrix.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Precedence: list Cc: Stefano Stabellini <sstabellini@kernel.org>, Julien Grall <julien@xen.org>, Wei Liu <wl@xen.org>, paul@xen.org, Andrew Cooper <andrew.cooper3@citrix.com>, Ian Jackson <ian.jackson@eu.citrix.com>, George Dunlap <george.dunlap@citrix.com>, Jan Beulich <jbeulich@suse.com>, Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>, Roger Pau Monne <roger.pau@citrix.com> Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Series	[for-4.14,v2] x86/tlb: fix assisted flush usage \| expand [for-4.14,v2] x86/tlb: fix assisted flush usage

Roger Pau Monné June 23, 2020, 2:50 p.m. UTC

Commit e9aca9470ed86 introduced a regression when avoiding sending
IPIs for certain flush operations. Xen page fault handler
(spurious_page_fault) relies on blocking interrupts in order to
prevent handling TLB flush IPIs and thus preventing other CPUs from
removing page tables pages. Switching to assisted flushing avoided such
IPIs, and thus can result in pages belonging to the page tables being
removed (and possibly re-used) while __page_fault_type is being
executed.

Force some of the TLB flushes to use IPIs, thus avoiding the assisted
TLB flush. Those selected flushes are the page type change (when
switching from a page table type to a different one, ie: a page that
has been removed as a page table) and page allocation. This sadly has
a negative performance impact on the pvshim, as less assisted flushes
can be used.

Introduce a new flag (FLUSH_FORCE_IPI) and helper to force a TLB flush
using an IPI (flush_tlb_mask_sync). Note that the flag is only
meaningfully defined when the hypervisor supports PV or shadow paging
mode, as otherwise hardware assisted paging domains are in charge of
their page tables and won't share page tables with Xen, thus not
influencing the result of page walks performed by the spurious fault
handler.

Just passing this new flag when calling flush_area_mask prevents the
usage of the assisted flush without any other side effects.

Note the flag is not defined on Arm, and the introduced helper is just
a dummy alias to the existing flush_tlb_mask.

Fixes: e9aca9470ed86 ('x86/tlb: use Xen L0 assisted TLB flush when available')
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Add a comment describing the usage of FLUSH_FORCE_IPI (and why no
   modifications to flush_area_mask are required).
 - Use PGT_root_page_table instead of PGT_l4_page_table.
 - Also perform IPI flushes if configured with shadow paging support.
 - Use ifdef instead of if.
---
 xen/arch/x86/mm.c              | 12 +++++++++++-
 xen/common/memory.c            |  2 +-
 xen/common/page_alloc.c        |  2 +-
 xen/include/asm-arm/flushtlb.h |  1 +
 xen/include/asm-x86/flushtlb.h | 18 ++++++++++++++++++
 xen/include/xen/mm.h           |  8 ++++++--
 6 files changed, 38 insertions(+), 5 deletions(-)

Julien Grall June 23, 2020, 3:08 p.m. UTC | #1

Hi Roger,

On 23/06/2020 15:50, Roger Pau Monne wrote:
> diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
> index 9b62087be1..f360166f00 100644
> --- a/xen/include/xen/mm.h
> +++ b/xen/include/xen/mm.h
> @@ -639,7 +639,8 @@ static inline void accumulate_tlbflush(bool *need_tlbflush,
>       }
>   }
>   
> -static inline void filtered_flush_tlb_mask(uint32_t tlbflush_timestamp)
> +static inline void filtered_flush_tlb_mask(uint32_t tlbflush_timestamp,
> +                                           bool sync)

I read the commit message and went through the code, but it is still not 
clear what "sync" means in a non-architectural way.

As an Arm developper, I would assume this means we don't wait for the 
TLB flush to complete. But I am sure this would result to some 
unexpected behavior.

So can you explain on non-x86 words what this really mean?

Cheers,

Roger Pau Monné June 23, 2020, 3:15 p.m. UTC | #2

On Tue, Jun 23, 2020 at 04:08:06PM +0100, Julien Grall wrote:
> Hi Roger,
> 
> On 23/06/2020 15:50, Roger Pau Monne wrote:
> > diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
> > index 9b62087be1..f360166f00 100644
> > --- a/xen/include/xen/mm.h
> > +++ b/xen/include/xen/mm.h
> > @@ -639,7 +639,8 @@ static inline void accumulate_tlbflush(bool *need_tlbflush,
> >       }
> >   }
> > -static inline void filtered_flush_tlb_mask(uint32_t tlbflush_timestamp)
> > +static inline void filtered_flush_tlb_mask(uint32_t tlbflush_timestamp,
> > +                                           bool sync)
> 
> I read the commit message and went through the code, but it is still not
> clear what "sync" means in a non-architectural way.
> 
> As an Arm developper, I would assume this means we don't wait for the TLB
> flush to complete. But I am sure this would result to some unexpected
> behavior.

No, when we return from filtered_flush_tlb_mask the flush has been
performed (either with sync or without), but I understand the
confusion given the parameter name.

> So can you explain on non-x86 words what this really mean?

sync (in this context) means to force the usage of an IPI (if built
with PV or shadow paging support) in order to perform the flush.
AFAICT on Arm you always avoid an IPI when performing a flush, and
that's fine because you don't have PV or shadow, and then you don't
require this. Also I'm not sure Arm has the concept of a spurious
page fault.

I could rename to force_ipi (or require_ipi) if that's better?

Roger.

Julien Grall June 23, 2020, 3:46 p.m. UTC | #3

On 23/06/2020 16:15, Roger Pau Monné wrote:
> On Tue, Jun 23, 2020 at 04:08:06PM +0100, Julien Grall wrote:
>> Hi Roger,
>>
>> On 23/06/2020 15:50, Roger Pau Monne wrote:
>>> diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
>>> index 9b62087be1..f360166f00 100644
>>> --- a/xen/include/xen/mm.h
>>> +++ b/xen/include/xen/mm.h
>>> @@ -639,7 +639,8 @@ static inline void accumulate_tlbflush(bool *need_tlbflush,
>>>        }
>>>    }
>>> -static inline void filtered_flush_tlb_mask(uint32_t tlbflush_timestamp)
>>> +static inline void filtered_flush_tlb_mask(uint32_t tlbflush_timestamp,
>>> +                                           bool sync)
>>
>> I read the commit message and went through the code, but it is still not
>> clear what "sync" means in a non-architectural way.
>>
>> As an Arm developper, I would assume this means we don't wait for the TLB
>> flush to complete. But I am sure this would result to some unexpected
>> behavior.
> 
> No, when we return from filtered_flush_tlb_mask the flush has been
> performed (either with sync or without), but I understand the
> confusion given the parameter name.
> 
>> So can you explain on non-x86 words what this really mean?
> 
> sync (in this context) means to force the usage of an IPI (if built
> with PV or shadow paging support) in order to perform the flush.

This is compare to?

> AFAICT on Arm you always avoid an IPI when performing a flush, and
> that's fine because you don't have PV or shadow, and then you don't
> require this.

Arm provides instructions to broadcast TLB flush, so by the time one of 
instruction is completed there is all the TLB entry associated to the 
translation doesn't exist.

I don't think using PV or shadow would change anything here in the way 
we do the flush.

> Also I'm not sure Arm has the concept of a spurious
> page fault.

So if I understand correctly, the HW may raise a fault even if the 
mapping was in the page-tables. Is it correct?

We have a spurious page fault handler for stage-2 (aka EPT on x86) as we 
need to have an invalid mapping to transition for certain page-tables 
update (e.g. superpage shattering). We are using the same rwlock with 
the page fault handler and the page-table update, so there is no way the 
two can run concurrently.

> 
> I could rename to force_ipi (or require_ipi) if that's better?

Hmmm, based on what I wrote above, I don't think this name would be more 
suitable. However, regardless the name, it is not clear to me when a 
caller should use false or true.

Have you considered a rwlock to synchronize the two?

Cheers,

Roger Pau Monné June 23, 2020, 4:16 p.m. UTC | #4

On Tue, Jun 23, 2020 at 04:46:29PM +0100, Julien Grall wrote:
> 
> 
> On 23/06/2020 16:15, Roger Pau Monné wrote:
> > On Tue, Jun 23, 2020 at 04:08:06PM +0100, Julien Grall wrote:
> > > Hi Roger,
> > > 
> > > On 23/06/2020 15:50, Roger Pau Monne wrote:
> > > > diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
> > > > index 9b62087be1..f360166f00 100644
> > > > --- a/xen/include/xen/mm.h
> > > > +++ b/xen/include/xen/mm.h
> > > > @@ -639,7 +639,8 @@ static inline void accumulate_tlbflush(bool *need_tlbflush,
> > > >        }
> > > >    }
> > > > -static inline void filtered_flush_tlb_mask(uint32_t tlbflush_timestamp)
> > > > +static inline void filtered_flush_tlb_mask(uint32_t tlbflush_timestamp,
> > > > +                                           bool sync)
> > > 
> > > I read the commit message and went through the code, but it is still not
> > > clear what "sync" means in a non-architectural way.
> > > 
> > > As an Arm developper, I would assume this means we don't wait for the TLB
> > > flush to complete. But I am sure this would result to some unexpected
> > > behavior.
> > 
> > No, when we return from filtered_flush_tlb_mask the flush has been
> > performed (either with sync or without), but I understand the
> > confusion given the parameter name.
> > 
> > > So can you explain on non-x86 words what this really mean?
> > 
> > sync (in this context) means to force the usage of an IPI (if built
> > with PV or shadow paging support) in order to perform the flush.
> 
> This is compare to?

Using assisted flushes, like you do on Arm, where you don't send an
IPI in order to achieve a TLB flush on a remote pCPU.

> > AFAICT on Arm you always avoid an IPI when performing a flush, and
> > that's fine because you don't have PV or shadow, and then you don't
> > require this.
> 
> Arm provides instructions to broadcast TLB flush, so by the time one of
> instruction is completed there is all the TLB entry associated to the
> translation doesn't exist.
> 
> I don't think using PV or shadow would change anything here in the way we do
> the flush.

TBH, I'm not sure how this applies to Arm. There's no PV or shadow
implementation, so I have no idea whether this would apply or not.

> > Also I'm not sure Arm has the concept of a spurious
> > page fault.
> 
> So if I understand correctly, the HW may raise a fault even if the mapping
> was in the page-tables. Is it correct?

Yes, this can happen when you promote the permission of a page table
entry without doing a TLB flush AFAICT. Ie: you have a read-only page,
which is promoted to writable, but you don't perform a TLB flush and
just rely on getting a page fault that will clear the TLB entry and
retry.

> We have a spurious page fault handler for stage-2 (aka EPT on x86) as we
> need to have an invalid mapping to transition for certain page-tables update
> (e.g. superpage shattering). We are using the same rwlock with the page
> fault handler and the page-table update, so there is no way the two can run
> concurrently.

This is slightly different as it's used by PV page tables, so the
fault is triggered much more often than the fault handler you are
referring to IMO.

> > 
> > I could rename to force_ipi (or require_ipi) if that's better?
> 
> Hmmm, based on what I wrote above, I don't think this name would be more
> suitable. However, regardless the name, it is not clear to me when a caller
> should use false or true.
> 
> Have you considered a rwlock to synchronize the two?

Yes, the performance drop is huge when I tried. I could try to refine,
but I think there's always going to be a performance drop, as you then
require mutual exclusion when modifying the page tables (you take the
lock in write mode). Right now modification of the page tables can be
done concurrently.

FWIW Xen explicitly moved from using a lock into this model in
3203345bb13 apparently due to some deadlock situation. I'm not sure
if that still holds.

Roger.

Julien Grall June 24, 2020, 11:10 a.m. UTC | #5

Hi Roger,

On 23/06/2020 17:16, Roger Pau Monné wrote:
> On Tue, Jun 23, 2020 at 04:46:29PM +0100, Julien Grall wrote:
>>
>>
>> On 23/06/2020 16:15, Roger Pau Monné wrote:
>>> On Tue, Jun 23, 2020 at 04:08:06PM +0100, Julien Grall wrote:
>>>> Hi Roger,
>>>>
>>>> On 23/06/2020 15:50, Roger Pau Monne wrote:
>>>>> diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
>>>>> index 9b62087be1..f360166f00 100644
>>>>> --- a/xen/include/xen/mm.h
>>>>> +++ b/xen/include/xen/mm.h
>>>>> @@ -639,7 +639,8 @@ static inline void accumulate_tlbflush(bool *need_tlbflush,
>>>>>         }
>>>>>     }
>>>>> -static inline void filtered_flush_tlb_mask(uint32_t tlbflush_timestamp)
>>>>> +static inline void filtered_flush_tlb_mask(uint32_t tlbflush_timestamp,
>>>>> +                                           bool sync)
>>>>
>>>> I read the commit message and went through the code, but it is still not
>>>> clear what "sync" means in a non-architectural way.
>>>>
>>>> As an Arm developper, I would assume this means we don't wait for the TLB
>>>> flush to complete. But I am sure this would result to some unexpected
>>>> behavior.
>>>
>>> No, when we return from filtered_flush_tlb_mask the flush has been
>>> performed (either with sync or without), but I understand the
>>> confusion given the parameter name.
>>>
>>>> So can you explain on non-x86 words what this really mean?
>>>
>>> sync (in this context) means to force the usage of an IPI (if built
>>> with PV or shadow paging support) in order to perform the flush.
>>
>> This is compare to?
> 
> Using assisted flushes, like you do on Arm, where you don't send an
> IPI in order to achieve a TLB flush on a remote pCPU.

AFAICT, the assisted flushes only happen when running a nested Xen. Is 
that correct?

> 
>>> AFAICT on Arm you always avoid an IPI when performing a flush, and
>>> that's fine because you don't have PV or shadow, and then you don't
>>> require this.
>>
>> Arm provides instructions to broadcast TLB flush, so by the time one of
>> instruction is completed there is all the TLB entry associated to the
>> translation doesn't exist.
>>
>> I don't think using PV or shadow would change anything here in the way we do
>> the flush.
> 
> TBH, I'm not sure how this applies to Arm. There's no PV or shadow
> implementation, so I have no idea whether this would apply or not.

Yes there is none. However, my point was that if we had to implement 
PV/shadow on Arm then an IPI would definitely be my last choice.

>>>
>>> I could rename to force_ipi (or require_ipi) if that's better?
>>
>> Hmmm, based on what I wrote above, I don't think this name would be more
>> suitable. However, regardless the name, it is not clear to me when a caller
>> should use false or true.
>>
>> Have you considered a rwlock to synchronize the two?
> 
> Yes, the performance drop is huge when I tried. I could try to refine,
> but I think there's always going to be a performance drop, as you then
> require mutual exclusion when modifying the page tables (you take the
> lock in write mode). Right now modification of the page tables can be
> done concurrently.

Fair enough. I will scratch that suggestion then. Thanks for the 
explanation!

So now getting back to filtered_flush_tlb(). AFAICT, the only two 
callers are in common code. The two are used for allocation purpose, so 
may I ask why they need to use different kind of flush?

> 
> FWIW Xen explicitly moved from using a lock into this model in
> 3203345bb13 apparently due to some deadlock situation. I'm not sure
> if that still holds.

The old classic major change with limited explanation :/.

Cheers,

Roger Pau Monné June 25, 2020, 9:33 a.m. UTC | #6

On Wed, Jun 24, 2020 at 12:10:45PM +0100, Julien Grall wrote:
> Hi Roger,
> 
> On 23/06/2020 17:16, Roger Pau Monné wrote:
> > On Tue, Jun 23, 2020 at 04:46:29PM +0100, Julien Grall wrote:
> > > 
> > > 
> > > On 23/06/2020 16:15, Roger Pau Monné wrote:
> > > > On Tue, Jun 23, 2020 at 04:08:06PM +0100, Julien Grall wrote:
> > > > > Hi Roger,
> > > > > 
> > > > > On 23/06/2020 15:50, Roger Pau Monne wrote:
> > > > > > diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
> > > > > > index 9b62087be1..f360166f00 100644
> > > > > > --- a/xen/include/xen/mm.h
> > > > > > +++ b/xen/include/xen/mm.h
> > > > > > @@ -639,7 +639,8 @@ static inline void accumulate_tlbflush(bool *need_tlbflush,
> > > > > >         }
> > > > > >     }
> > > > > > -static inline void filtered_flush_tlb_mask(uint32_t tlbflush_timestamp)
> > > > > > +static inline void filtered_flush_tlb_mask(uint32_t tlbflush_timestamp,
> > > > > > +                                           bool sync)
> > > > > 
> > > > > I read the commit message and went through the code, but it is still not
> > > > > clear what "sync" means in a non-architectural way.
> > > > > 
> > > > > As an Arm developper, I would assume this means we don't wait for the TLB
> > > > > flush to complete. But I am sure this would result to some unexpected
> > > > > behavior.
> > > > 
> > > > No, when we return from filtered_flush_tlb_mask the flush has been
> > > > performed (either with sync or without), but I understand the
> > > > confusion given the parameter name.
> > > > 
> > > > > So can you explain on non-x86 words what this really mean?
> > > > 
> > > > sync (in this context) means to force the usage of an IPI (if built
> > > > with PV or shadow paging support) in order to perform the flush.
> > > 
> > > This is compare to?
> > 
> > Using assisted flushes, like you do on Arm, where you don't send an
> > IPI in order to achieve a TLB flush on a remote pCPU.
> 
> AFAICT, the assisted flushes only happen when running a nested Xen. Is that
> correct?

ATM yes, we don't have support for the newly introduced AMD INVLPGB
instruction yet, which provides such functionality on bare metal.

> > 
> > > > AFAICT on Arm you always avoid an IPI when performing a flush, and
> > > > that's fine because you don't have PV or shadow, and then you don't
> > > > require this.
> > > 
> > > Arm provides instructions to broadcast TLB flush, so by the time one of
> > > instruction is completed there is all the TLB entry associated to the
> > > translation doesn't exist.
> > > 
> > > I don't think using PV or shadow would change anything here in the way we do
> > > the flush.
> > 
> > TBH, I'm not sure how this applies to Arm. There's no PV or shadow
> > implementation, so I have no idea whether this would apply or not.
> 
> Yes there is none. However, my point was that if we had to implement
> PV/shadow on Arm then an IPI would definitely be my last choice.

Right, this mostly depends on how you perform page table modifications
and whether you have to handle spurious faults like x86 does.

> > > > 
> > > > I could rename to force_ipi (or require_ipi) if that's better?
> > > 
> > > Hmmm, based on what I wrote above, I don't think this name would be more
> > > suitable. However, regardless the name, it is not clear to me when a caller
> > > should use false or true.
> > > 
> > > Have you considered a rwlock to synchronize the two?
> > 
> > Yes, the performance drop is huge when I tried. I could try to refine,
> > but I think there's always going to be a performance drop, as you then
> > require mutual exclusion when modifying the page tables (you take the
> > lock in write mode). Right now modification of the page tables can be
> > done concurrently.
> 
> Fair enough. I will scratch that suggestion then. Thanks for the
> explanation!
> 
> So now getting back to filtered_flush_tlb(). AFAICT, the only two callers
> are in common code. The two are used for allocation purpose, so may I ask
> why they need to use different kind of flush?

Looking at it again, this is wrong. I've just realized that
populate_physmap will, depending on the situation, use the
MEMF_no_tlbflush flag, and so it needs to perform the flush by itself
(and that's why filtered_flush_tlb_mask is used).

I guess you will be fine with removing the sync parameter then, and on
x86 force filtered_flush_tlb_mask to always use physical IPIs in
order to perform the flush?

Thanks, Roger.

[for-4.14,v2] x86/tlb: fix assisted flush usage

Commit Message

Comments

Patch