[3/3] riscv: rewrite tlb flush for performance improvement

Message ID	LO2P265MB08471CC8597FD60691334624D64C0@LO2P265MB0847.GBRP265.PROD.OUTLOOK.COM (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-riscv-bounces+patchwork-linux-riscv=patchwork.kernel.org@lists.infradead.org> From: Gary Guo <gary@garyguo.net> To: "linux-riscv@lists.infradead.org" <linux-riscv@lists.infradead.org> Subject: [PATCH 3/3] riscv: rewrite tlb flush for performance improvement Thread-Topic: [PATCH 3/3] riscv: rewrite tlb flush for performance improvement Thread-Index: AdTUhSADMvbAADmvSDCW/nlQt8phuA== Date: Thu, 7 Mar 2019 01:29:12 +0000 Message-ID: <LO2P265MB08471CC8597FD60691334624D64C0@LO2P265MB0847.GBRP265.PROD.OUTLOOK.COM> Accept-Language: en-GB, en-US Content-Language: en-US received-spf: None (protection.outlook.com: garyguo.net does not designate permitted sender hosts) MIME-Version: 1.0 Precedence: list Cc: Palmer Dabbelt <palmer@sifive.com>, Albert Ou <aou@eecs.berkeley.edu> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-riscv" <linux-riscv-bounces@lists.infradead.org> Errors-To: linux-riscv-bounces+patchwork-linux-riscv=patchwork.kernel.org@lists.infradead.org
Series	[1/3] riscv: move switch_mm to its own file \| expand [1/3] riscv: move switch_mm to its own file [2/3] riscv: fix SBI call of sbi_remote_sfence_vma{,_asid}. [3/3] riscv: rewrite tlb flush for performance improvement

Gary Guo March 7, 2019, 1:29 a.m. UTC

This patch rewrites the logic related to TLB flushing, both to cleanup
the code and to improve performance.

We now use sfence.vma variant with specified ASID and virtual address
whenever possible.  Even though only ASID 0 is used, it still improves
performance by preventing global mappings from being flushed from TLB.

This patch also includes a IPI-based remote TLB shootdown, which is useful
at this stage for testing because BBL/OpenSBI ignores operands of
sbi_remote_sfence_vma_asid and always perform a global TLB flush.
The IPI-based remote TLB shootdown is gated behind RISCV_TLBI_IPI config
and is off by default.

Signed-off-by: Xuan Guo <gary@garyguo.net>
---
 arch/riscv/Kconfig                |  40 +++++++++
 arch/riscv/include/asm/pgtable.h  |   2 +-
 arch/riscv/include/asm/tlbflush.h |  82 +++++++++--------
 arch/riscv/mm/Makefile            |   2 +
 arch/riscv/mm/context.c           |   8 +-
 arch/riscv/mm/tlbflush.c          | 144 ++++++++++++++++++++++++++++++
 6 files changed, 239 insertions(+), 39 deletions(-)
 create mode 100644 arch/riscv/mm/tlbflush.c

Christoph Hellwig March 8, 2019, 2:39 p.m. UTC | #1

> +menu "Virtual memory management"
> +
> +config RISCV_TLBI_IPI
> +	bool "Use IPI instead of SBI for remote TLB shootdown"
> +	default n
> +	help
> +	  Instead of using remote TLB shootdown interfaces provided by SBI,
> +	  use IPI to handle remote TLB shootdown within Linux kernel.
> +
> +	  BBL/OpenSBI are currently ignoring ASID and address range provided
> +	  by SBI call argument, and do a full TLB flush instead. This may
> +	  negatively impact performance on implementations with page-level
> +	  sfence.vma support.
> +
> +	  If you don't know what to do here, say N.

Requiring a kconfig here is rather sad.  For now I would just
switch entirely to your non-SBI version as doing SBI calls for this
is rather pointless to start with.  Either we get real architectural
hardware acceleration, or we might as well use IPIs ourselves.

That being said if there are strong arguments to keep the old code
I'd still prefer that to be runtime selectable.

> +
> +config RISCV_TLBI_MAX_OPS
> +	int "Max number of page-level sfence.vma per range TLB flush"
> +	range 1 511
> +	default 1
> +	help
> +	  This config specifies how many page-level sfence.vma can the Linux
> +	  kernel issue when the kernel needs to flush a range from the TLB.
> +	  If the required number of page-level sfence.vma exceeds this limit,
> +	  a full sfence.vma is issued.
> +
> +	  Increase this number can negatively impact performance on
> +	  implemntations where sfence.vma's address operand is ignored and
> +	  always perform a global TLB flush.
> +
> +	  On the other hand, implementations with page-level TLB flush support
> +	  can benefit from a larger number.
> +
> +	  If you don't know what to do here, keep the default value 1.

Again, I don't think hardcoding this makes any sense.  To make the
setting it needs to be overridable, and preferably provided by the
SBI code in some form (DT entry?).

> index 54fee0cadb1e..f254237a3bda 100644
> --- a/arch/riscv/include/asm/tlbflush.h
> +++ b/arch/riscv/include/asm/tlbflush.h
> @@ -1,6 +1,5 @@
>  /*
> - * Copyright (C) 2009 Chen Liqin <liqin.chen@sunplusct.com>
> - * Copyright (C) 2012 Regents of the University of California
> + * Copyright (C) 2019 Gary Guo, University of Cambridge

Unless you complete rewrite the file it is rather rude to remove
the existing copyright.  There still seem to be a decent amount
of at least comments left from the old codebase.

> +static inline void local_flush_tlb_page(struct vm_area_struct *vma,
> +		unsigned long addr)
> +{
> +	__asm__ __volatile__ ("sfence.vma %0, %1" : : "r" (addr), "r" (0) : "memory");
> +}

Please avoid lines over 80 chars.  Also I find inline assembly much
easier to read if each argument has its own line.

> +#ifndef CONFIG_SMP

If you rewrite this anyway can you switch the order around and use
an ifdef instead of ifndef?

> +extern void flush_tlb_all(void);
> +extern void flush_tlb_mm(struct mm_struct *mm);
> +extern void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr);
> +extern void flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
> +		unsigned long end);
> +extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);

No need for all the externs.

> diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
> new file mode 100644
> index 000000000000..76cea33aa9c7
> --- /dev/null
> +++ b/arch/riscv/mm/tlbflush.c
> @@ -0,0 +1,144 @@
> +/*
> + * Copyright (C) 2019 Gary Guo, University of Cambridge
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
> + */

Please use an SPDX tag instead of the license boilerplate.

> +#include <linux/mm.h>
> +#include <asm/sbi.h>
> +
> +/*
> + * BBL/OpenSBI are currently ignoring ASID and address range provided
> + * by SBI call argument, and do a full TLB flush instead.

This is really information for the changelog, not really for the code.

> +static void ipi_remote_sfence_vma(void *info)
> +{
> +	struct tlbi *data = (struct tlbi *) info;

No need for the cast.

> +	if (size == (unsigned long) -1) {

Maybe add a symbolic RISCV_FLUSH_ALL macros for the magic -1?

> +void flush_tlb_page(struct vm_area_struct *vma,
> +		unsigned long addr)

The second argument easily fits onto the first line.

> +void flush_tlb_range(struct vm_area_struct *vma,
> +		unsigned long start, unsigned long end)

Same here.

Gary Guo March 8, 2019, 3:55 p.m. UTC | #2

On 08/03/2019 14:39, Christoph Hellwig wrote:
>> +menu "Virtual memory management"
>> +
>> +config RISCV_TLBI_IPI
>> +	bool "Use IPI instead of SBI for remote TLB shootdown"
>> +	default n
>> +	help
>> +	  Instead of using remote TLB shootdown interfaces provided by SBI,
>> +	  use IPI to handle remote TLB shootdown within Linux kernel.
>> +
>> +	  BBL/OpenSBI are currently ignoring ASID and address range provided
>> +	  by SBI call argument, and do a full TLB flush instead. This may
>> +	  negatively impact performance on implementations with page-level
>> +	  sfence.vma support.
>> +
>> +	  If you don't know what to do here, say N.
> 
> Requiring a kconfig here is rather sad.  For now I would just
> switch entirely to your non-SBI version as doing SBI calls for this
> is rather pointless to start with.  Either we get real architectural
> hardware acceleration, or we might as well use IPIs ourselves.
> 

The problem is that technically SBI should hide the details, and we 
should automatically, and transparently benefit from architectural 
hardware acceleration. The only issue is that the SBI isn't doing it 
well at the moment, which to me looks more like an erratum. Once SBI is 
doing its job, we should switch back and use SBI instead. We can change 
to default value to Y first, but I do believe we should still be able to 
use SBI-version if necessary, e.g. to test SBI implementation.

> That being said if there are strong arguments to keep the old code
> I'd still prefer that to be runtime selectable.
> 
>> +
>> +config RISCV_TLBI_MAX_OPS
>> +	int "Max number of page-level sfence.vma per range TLB flush"
>> +	range 1 511
>> +	default 1
>> +	help
>> +	  This config specifies how many page-level sfence.vma can the Linux
>> +	  kernel issue when the kernel needs to flush a range from the TLB.
>> +	  If the required number of page-level sfence.vma exceeds this limit,
>> +	  a full sfence.vma is issued.
>> +
>> +	  Increase this number can negatively impact performance on
>> +	  implemntations where sfence.vma's address operand is ignored and
>> +	  always perform a global TLB flush.
>> +
>> +	  On the other hand, implementations with page-level TLB flush support
>> +	  can benefit from a larger number.
>> +
>> +	  If you don't know what to do here, keep the default value 1.
> 
> Again, I don't think hardcoding this makes any sense.  To make the
> setting it needs to be overridable, and preferably provided by the
> SBI code in some form (DT entry?).
> 

Currently we have no way to retrieve the ideal size - so I just provide 
a hardcoded constant. What do you suggest for short-term before we have 
a reliable way to retrieve this value?

>> +#ifndef CONFIG_SMP
> 
> If you rewrite this anyway can you switch the order around and use
> an ifdef instead of ifndef?

I did a fast grep on other archs, and it seems that at least for 
CONFIG_SMP, it's more common to put non-SMP part first.

>> +#include <linux/mm.h>
>> +#include <asm/sbi.h>
>> +
>> +/*
>> + * BBL/OpenSBI are currently ignoring ASID and address range provided
>> + * by SBI call argument, and do a full TLB flush instead.
> 
> This is really information for the changelog, not really for the code.
> 
>> +static void ipi_remote_sfence_vma(void *info)
>> +{
>> +	struct tlbi *data = (struct tlbi *) info;
> 
> No need for the cast.

Thanks for pointing out. Writing code for C++ for too long I already had 
muscle memory for the cast :D

>> +	if (size == (unsigned long) -1) {
> 
> Maybe add a symbolic RISCV_FLUSH_ALL macros for the magic -1?

This is just to unify with SBI (currently we also use -1 in SBI-calls). 
Since the argument is unsigned long, from 0 to -1 basically just 
represent the entire address space. Do you think it's better to replace 
all SBI with -1 argument with a macro as well?

Christoph Hellwig March 8, 2019, 4:31 p.m. UTC | #3

On Fri, Mar 08, 2019 at 03:55:43PM +0000, Gary Guo wrote:
> The problem is that technically SBI should hide the details, and we 
> should automatically, and transparently benefit from architectural 
> hardware acceleration.

But it doesn't.  The SBI effectively requires an additional trap,
which makes the whole thing a little pointless.  I'd much rather
have well working code in the kernel, and if people want to do
hardware based TLB flushing to propose a well defined Supervisor
ABI extension for it.

> Currently we have no way to retrieve the ideal size - so I just provide 
> a hardcoded constant. What do you suggest for short-term before we have 
> a reliable way to retrieve this value?

Kernel boot option.

> >> +#ifndef CONFIG_SMP
> > 
> > If you rewrite this anyway can you switch the order around and use
> > an ifdef instead of ifndef?
> 
> I did a fast grep on other archs, and it seems that at least for 
> CONFIG_SMP, it's more common to put non-SMP part first.

Most of that probably is copy and paste from x86 which had the
non-SMP version only first.  Still not a good idea.

> >> +	if (size == (unsigned long) -1) {
> > 
> > Maybe add a symbolic RISCV_FLUSH_ALL macros for the magic -1?
> 
> This is just to unify with SBI (currently we also use -1 in SBI-calls). 
> Since the argument is unsigned long, from 0 to -1 basically just 
> represent the entire address space. Do you think it's better to replace 
> all SBI with -1 argument with a macro as well?

Note that I'm fine keeping the (unsigned long)-1 as the actual ABI, it
just would nice to give it a symbolic name.

Anup Patel March 8, 2019, 4:46 p.m. UTC | #4

On Fri, Mar 8, 2019 at 9:25 PM Gary Guo <gary@garyguo.net> wrote:
>
> On 08/03/2019 14:39, Christoph Hellwig wrote:
> >> +menu "Virtual memory management"
> >> +
> >> +config RISCV_TLBI_IPI
> >> +    bool "Use IPI instead of SBI for remote TLB shootdown"
> >> +    default n
> >> +    help
> >> +      Instead of using remote TLB shootdown interfaces provided by SBI,
> >> +      use IPI to handle remote TLB shootdown within Linux kernel.
> >> +
> >> +      BBL/OpenSBI are currently ignoring ASID and address range provided
> >> +      by SBI call argument, and do a full TLB flush instead. This may
> >> +      negatively impact performance on implementations with page-level
> >> +      sfence.vma support.
> >> +
> >> +      If you don't know what to do here, say N.
> >
> > Requiring a kconfig here is rather sad.  For now I would just
> > switch entirely to your non-SBI version as doing SBI calls for this
> > is rather pointless to start with.  Either we get real architectural
> > hardware acceleration, or we might as well use IPIs ourselves.
> >
>
> The problem is that technically SBI should hide the details, and we
> should automatically, and transparently benefit from architectural
> hardware acceleration. The only issue is that the SBI isn't doing it
> well at the moment, which to me looks more like an erratum. Once SBI is
> doing its job, we should switch back and use SBI instead. We can change
> to default value to Y first, but I do believe we should still be able to
> use SBI-version if necessary, e.g. to test SBI implementation.

Currently OpenSBI is in feature parity with BBL and now we are trying
to do things in a better way. We will certainly implement
SBI_REMOTE_SFENCE_VMA and SBI_REMOTE_SFENCE_VMA_ASID
in a better way instead of flushing everything.

Here's the link to track OpenSBI work:
https://github.com/riscv/opensbi/issues/87

I agree with you SBI calls should hide details of remote SFENCE
so that in-future some platform can have platform specific way to
accelerate remote SFENCE which is abstracted by SBI calls.

There is one more advantage in using SBI calls (apart from above)
for virtualization. Let's say we have a Guest/VM with more than one
VCPUs. It is possible that Guest VCPUs run on same host CPU in
which case the hypervisor can avoid unnecessary IPIs if Guest Linux
is using SBI calls for remote SFENCE. Now if Guest Linux uses its
own IPI based implementation then it will do redundant IPIs when
Guest VCPUs are running on same host CPU.

in summary, we should certainly use SBI calls for remote SFENCE to:
1. Allow SBI calls to hide platform specific acceleration
2. Be more virtualization friendly allowing hypervisor to optimize IPIs

>
>
> > That being said if there are strong arguments to keep the old code
> > I'd still prefer that to be runtime selectable.

We should certainly keep it runtime selectable.

Regards,
Anup

Anup Patel March 8, 2019, 4:50 p.m. UTC | #5

On Fri, Mar 8, 2019 at 10:01 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Fri, Mar 08, 2019 at 03:55:43PM +0000, Gary Guo wrote:
> > The problem is that technically SBI should hide the details, and we
> > should automatically, and transparently benefit from architectural
> > hardware acceleration.
>
> But it doesn't.  The SBI effectively requires an additional trap,
> which makes the whole thing a little pointless.  I'd much rather
> have well working code in the kernel, and if people want to do
> hardware based TLB flushing to propose a well defined Supervisor
> ABI extension for it.

For your information, the IPI triggering is also done using SBI call
so we are not saving much by not using SBI calls for remote SFENCE.

If a platform implements remote TLB flushing in HW then OpenSBI
can use that HW mechanism instead of IPIs. For Linux, the platform
remote TLB flush acceleration will be abstracted by SBI calls.

Regards,
Anup

Christoph Hellwig March 8, 2019, 5:18 p.m. UTC | #6

On Fri, Mar 08, 2019 at 10:20:49PM +0530, Anup Patel wrote:
> For your information, the IPI triggering is also done using SBI call
> so we are not saving much by not using SBI calls for remote SFENCE.

Not yet.  But all the usual implementation just implement IPIs by
MMIO writes to the CLINT.  Which could easily be delegated and we'd
save the trap.

Anup Patel March 10, 2019, 8:46 p.m. UTC | #7

On Fri, Mar 8, 2019 at 10:48 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Fri, Mar 08, 2019 at 10:20:49PM +0530, Anup Patel wrote:
> > For your information, the IPI triggering is also done using SBI call
> > so we are not saving much by not using SBI calls for remote SFENCE.
>
> Not yet.  But all the usual implementation just implement IPIs by
> MMIO writes to the CLINT.  Which could easily be delegated and we'd
> save the trap.

True but CLINT is not the defacto way triggering IPIs. Other SOCs will
have their own way of triggering IPIs. That's why is it is abstracted via
SBI call.

Regards,
Anup

Anup Patel March 10, 2019, 8:49 p.m. UTC | #8

On Mon, Mar 11, 2019 at 2:16 AM Anup Patel <anup@brainfault.org> wrote:
>
> On Fri, Mar 8, 2019 at 10:48 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Fri, Mar 08, 2019 at 10:20:49PM +0530, Anup Patel wrote:
> > > For your information, the IPI triggering is also done using SBI call
> > > so we are not saving much by not using SBI calls for remote SFENCE.
> >
> > Not yet.  But all the usual implementation just implement IPIs by
> > MMIO writes to the CLINT.  Which could easily be delegated and we'd
> > save the trap.
>
> True but CLINT is not the defacto way triggering IPIs. Other SOCs will
> have their own way of triggering IPIs. That's why is it is abstracted via
> SBI call.

There are also use-cases to run two separate OSes on same SOC
without virtualization. In such cases, both OSes cannot have access
to CLINT for triggering IPIs.

Regards,
Anup

Christoph Hellwig March 11, 2019, 3:53 p.m. UTC | #9

On Mon, Mar 11, 2019 at 02:19:54AM +0530, Anup Patel wrote:
> There are also use-cases to run two separate OSes on same SOC
> without virtualization. In such cases, both OSes cannot have access
> to CLINT for triggering IPIs.

And sometimes pigs can fly (no, really!).  But we should optimize
for performance on hardware we have not build some ivory towers
for grand architectures of the future.

And with that I don't mean cutting corners and weird micro-optimization,
but to think hard what layering makes sense.

Having to trap into machine mode software to flush TLBs does not make
sense in any normal architecture.

Anup Patel March 11, 2019, 4:28 p.m. UTC | #10

On Mon, Mar 11, 2019 at 9:23 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Mar 11, 2019 at 02:19:54AM +0530, Anup Patel wrote:
> > There are also use-cases to run two separate OSes on same SOC
> > without virtualization. In such cases, both OSes cannot have access
> > to CLINT for triggering IPIs.
>
> And sometimes pigs can fly (no, really!).  But we should optimize
> for performance on hardware we have not build some ivory towers
> for grand architectures of the future.

IMHO, we should not restrict use-cases of Linux RISC-V kernel. The
Linux RISC-V kernel should have mechanism to do remote TLB flushes
using both SBI calls as well as using it's own IPIs. We should let users
decide what mechanism of remote TLB flushes they want.

>
> And with that I don't mean cutting corners and weird micro-optimization,
> but to think hard what layering makes sense.
>
> Having to trap into machine mode software to flush TLBs does not make
> sense in any normal architecture.

That's your view of a "normal architecture". How I see this is that RISC-V
is allowing CPU designers to implement their own way of remote TLB flush
which a very unique thing.

Regards,
Anup

[3/3] riscv: rewrite tlb flush for performance improvement

Commit Message

Comments

Patch