mbox series

[RFC,0/6] x86: prefetch_page() vDSO call

Message ID 20210225072910.2811795-1-namit@vmware.com (mailing list archive)
Headers show
Series x86: prefetch_page() vDSO call | expand

Message

Nadav Amit Feb. 25, 2021, 7:29 a.m. UTC
From: Nadav Amit <namit@vmware.com>

Just as applications can use prefetch instructions to overlap
computations and memory accesses, applications may want to overlap the
page-faults and compute or overlap the I/O accesses that are required
for page-faults of different pages.

Applications can use multiple threads and cores for this matter, by
running one thread that prefetches the data (i.e., faults in the data)
and another that does the compute, but this scheme is inefficient. Using
mincore() can tell whether a page is mapped, but might not tell whether
the page is in the page-cache and does not fault in the data.

Introduce prefetch_page() vDSO-call to prefetch, i.e. fault-in memory
asynchronously. The semantic of this call is: try to prefetch a page of
in a given address and return zero if the page is accessible following
the call. Start I/O operations to retrieve the page if such operations
are required and there is no high memory pressure that might introduce
slowdowns.

Note that as usual the page might be paged-out at any point and
therefore, similarly to mincore(), there is no guarantee that the page
will be present at the time that the user application uses the data that
resides on the page. Nevertheless, it is expected that in the vast
majority of the cases this would not happen, since prefetch_page()
accesses the page and therefore sets the PTE access-bit (if it is
clear). 

The implementation is as follows. The vDSO code accesses the data,
triggering a page-fault it is not present. The handler detects based on
the instruction pointer that this is an asynchronous-#PF, using the
recently introduce vDSO exception tables. If the page can be brought
without waiting (e.g., the page is already in the page-cache), the
kernel handles the fault and returns success (zero). If there is memory
pressure that prevents the proper handling of the fault (i.e., requires
heavy-weight reclamation) it returns a failure. Otherwise, it starts an
I/O to bring the page and returns failure.

Compilers can be extended to issue the prefetch_page() calls when
needed.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: x86@kernel.org

Nadav Amit (6):
  vdso/extable: fix calculation of base
  x86/vdso: add mask and flags to extable
  x86/vdso: introduce page_prefetch()
  mm/swap_state: respect FAULT_FLAG_RETRY_NOWAIT
  mm: use lightweight reclaim on FAULT_FLAG_RETRY_NOWAIT
  testing/selftest: test vDSO prefetch_page()

 arch/x86/Kconfig                              |   1 +
 arch/x86/entry/vdso/Makefile                  |   1 +
 arch/x86/entry/vdso/extable.c                 |  70 +++--
 arch/x86/entry/vdso/extable.h                 |  21 +-
 arch/x86/entry/vdso/vdso.lds.S                |   1 +
 arch/x86/entry/vdso/vprefetch.S               |  39 +++
 arch/x86/entry/vdso/vsgx.S                    |   9 +-
 arch/x86/include/asm/vdso.h                   |  38 ++-
 arch/x86/mm/fault.c                           |  11 +-
 lib/vdso/Kconfig                              |   5 +
 mm/memory.c                                   |  47 +++-
 mm/shmem.c                                    |   1 +
 mm/swap_state.c                               |  12 +-
 tools/testing/selftests/vDSO/Makefile         |   2 +
 .../selftests/vDSO/vdso_test_prefetch_page.c  | 265 ++++++++++++++++++
 15 files changed, 470 insertions(+), 53 deletions(-)
 create mode 100644 arch/x86/entry/vdso/vprefetch.S
 create mode 100644 tools/testing/selftests/vDSO/vdso_test_prefetch_page.c

Comments

Peter Zijlstra Feb. 25, 2021, 8:40 a.m. UTC | #1
On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Just as applications can use prefetch instructions to overlap
> computations and memory accesses, applications may want to overlap the
> page-faults and compute or overlap the I/O accesses that are required
> for page-faults of different pages.
> 
> Applications can use multiple threads and cores for this matter, by
> running one thread that prefetches the data (i.e., faults in the data)
> and another that does the compute, but this scheme is inefficient. Using
> mincore() can tell whether a page is mapped, but might not tell whether
> the page is in the page-cache and does not fault in the data.
> 
> Introduce prefetch_page() vDSO-call to prefetch, i.e. fault-in memory
> asynchronously. The semantic of this call is: try to prefetch a page of
> in a given address and return zero if the page is accessible following
> the call. Start I/O operations to retrieve the page if such operations
> are required and there is no high memory pressure that might introduce
> slowdowns.
> 
> Note that as usual the page might be paged-out at any point and
> therefore, similarly to mincore(), there is no guarantee that the page
> will be present at the time that the user application uses the data that
> resides on the page. Nevertheless, it is expected that in the vast
> majority of the cases this would not happen, since prefetch_page()
> accesses the page and therefore sets the PTE access-bit (if it is
> clear). 
> 
> The implementation is as follows. The vDSO code accesses the data,
> triggering a page-fault it is not present. The handler detects based on
> the instruction pointer that this is an asynchronous-#PF, using the
> recently introduce vDSO exception tables. If the page can be brought
> without waiting (e.g., the page is already in the page-cache), the
> kernel handles the fault and returns success (zero). If there is memory
> pressure that prevents the proper handling of the fault (i.e., requires
> heavy-weight reclamation) it returns a failure. Otherwise, it starts an
> I/O to bring the page and returns failure.
> 
> Compilers can be extended to issue the prefetch_page() calls when
> needed.

Interesting, but given we've been removing explicit prefetch from some
parts of the kernel how useful is this in actual use? I'm thinking there
should at least be a real user and performance numbers with this before
merging.
Nadav Amit Feb. 25, 2021, 8:52 a.m. UTC | #2
> On Feb 25, 2021, at 12:40 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Just as applications can use prefetch instructions to overlap
>> computations and memory accesses, applications may want to overlap the
>> page-faults and compute or overlap the I/O accesses that are required
>> for page-faults of different pages.
>> 
>> Applications can use multiple threads and cores for this matter, by
>> running one thread that prefetches the data (i.e., faults in the data)
>> and another that does the compute, but this scheme is inefficient. Using
>> mincore() can tell whether a page is mapped, but might not tell whether
>> the page is in the page-cache and does not fault in the data.
>> 
>> Introduce prefetch_page() vDSO-call to prefetch, i.e. fault-in memory
>> asynchronously. The semantic of this call is: try to prefetch a page of
>> in a given address and return zero if the page is accessible following
>> the call. Start I/O operations to retrieve the page if such operations
>> are required and there is no high memory pressure that might introduce
>> slowdowns.
>> 
>> Note that as usual the page might be paged-out at any point and
>> therefore, similarly to mincore(), there is no guarantee that the page
>> will be present at the time that the user application uses the data that
>> resides on the page. Nevertheless, it is expected that in the vast
>> majority of the cases this would not happen, since prefetch_page()
>> accesses the page and therefore sets the PTE access-bit (if it is
>> clear).
>> 
>> The implementation is as follows. The vDSO code accesses the data,
>> triggering a page-fault it is not present. The handler detects based on
>> the instruction pointer that this is an asynchronous-#PF, using the
>> recently introduce vDSO exception tables. If the page can be brought
>> without waiting (e.g., the page is already in the page-cache), the
>> kernel handles the fault and returns success (zero). If there is memory
>> pressure that prevents the proper handling of the fault (i.e., requires
>> heavy-weight reclamation) it returns a failure. Otherwise, it starts an
>> I/O to bring the page and returns failure.
>> 
>> Compilers can be extended to issue the prefetch_page() calls when
>> needed.
> 
> Interesting, but given we've been removing explicit prefetch from some
> parts of the kernel how useful is this in actual use? I'm thinking there
> should at least be a real user and performance numbers with this before
> merging.

Can you give me a reference to the “removing explicit prefetch from some
parts of the kernel”?

I will work on an llvm/gcc plugin to provide some performance numbers.
I wanted to make sure that the idea is not a complete obscenity first.
Nadav Amit Feb. 25, 2021, 9:32 a.m. UTC | #3
> On Feb 25, 2021, at 12:52 AM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> 
> 
>> On Feb 25, 2021, at 12:40 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> 
>> On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
>>> From: Nadav Amit <namit@vmware.com>
>>> 
>>> Just as applications can use prefetch instructions to overlap
>>> computations and memory accesses, applications may want to overlap the
>>> page-faults and compute or overlap the I/O accesses that are required
>>> for page-faults of different pages.
[
[ snip ]

>> Interesting, but given we've been removing explicit prefetch from some
>> parts of the kernel how useful is this in actual use? I'm thinking there
>> should at least be a real user and performance numbers with this before
>> merging.
> 
> Can you give me a reference to the “removing explicit prefetch from some
> parts of the kernel”?

Oh. I get it - you mean we remove we remove the use of explicit memory
prefetch from the kernel code. Well, I don’t think it is really related,
but yes, performance numbers are needed.
Peter Zijlstra Feb. 25, 2021, 9:55 a.m. UTC | #4
On Thu, Feb 25, 2021 at 01:32:56AM -0800, Nadav Amit wrote:
> > On Feb 25, 2021, at 12:52 AM, Nadav Amit <nadav.amit@gmail.com> wrote:

> > Can you give me a reference to the “removing explicit prefetch from some
> > parts of the kernel”?

75d65a425c01 ("hlist: remove software prefetching in hlist iterators")
e66eed651fd1 ("list: remove prefetching from regular list iterators")

> Oh. I get it - you mean we remove we remove the use of explicit memory
> prefetch from the kernel code. Well, I don’t think it is really related,
> but yes, performance numbers are needed.

Right, so my main worry was that use of the prefetch instruction
actually hurt performance once the hardware prefetchers got smart
enough, this being a very similar construct (just on a different level
of the stack) should be careful not to suffer the same fate.
Matthew Wilcox (Oracle) Feb. 25, 2021, 12:16 p.m. UTC | #5
On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
> Just as applications can use prefetch instructions to overlap
> computations and memory accesses, applications may want to overlap the
> page-faults and compute or overlap the I/O accesses that are required
> for page-faults of different pages.

Isn't this madvise(MADV_WILLNEED)?
Nadav Amit Feb. 25, 2021, 4:56 p.m. UTC | #6
> On Feb 25, 2021, at 4:16 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
>> Just as applications can use prefetch instructions to overlap
>> computations and memory accesses, applications may want to overlap the
>> page-faults and compute or overlap the I/O accesses that are required
>> for page-faults of different pages.
> 
> Isn't this madvise(MADV_WILLNEED)?

Good point that I should have mentioned. In a way prefetch_page() a
combination of mincore() and MADV_WILLNEED.

There are 4 main differences from MADV_WILLNEED:

1. Much lower invocation cost if the readahead is not needed: this allows
to prefetch pages more abundantly.

2. Return value: return value tells you whether the page is accessible.
This makes it usable for coroutines, for instance. In this regard the
call is more similar to mincore() than MADV_WILLNEED.

3. The PTEs are mapped if the pages are already present in the
swap/page-cache, preventing an additional page-fault just to map them.

4. Avoiding heavy-weight reclamation on low memory (this may need to
be selective, and can be integrated with MADV_WILLNEED).
Matthew Wilcox (Oracle) Feb. 25, 2021, 5:32 p.m. UTC | #7
On Thu, Feb 25, 2021 at 04:56:50PM +0000, Nadav Amit wrote:
> 
> > On Feb 25, 2021, at 4:16 AM, Matthew Wilcox <willy@infradead.org> wrote:
> > 
> > On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
> >> Just as applications can use prefetch instructions to overlap
> >> computations and memory accesses, applications may want to overlap the
> >> page-faults and compute or overlap the I/O accesses that are required
> >> for page-faults of different pages.
> > 
> > Isn't this madvise(MADV_WILLNEED)?
> 
> Good point that I should have mentioned. In a way prefetch_page() a
> combination of mincore() and MADV_WILLNEED.
> 
> There are 4 main differences from MADV_WILLNEED:
> 
> 1. Much lower invocation cost if the readahead is not needed: this allows
> to prefetch pages more abundantly.

That seems like something that could be fixed in libc -- if we add a
page prefetch vdso call, an application calling posix_madvise() could
be implemented by calling this fast path.  Assuming the performance
increase justifies this extra complexity.

> 2. Return value: return value tells you whether the page is accessible.
> This makes it usable for coroutines, for instance. In this regard the
> call is more similar to mincore() than MADV_WILLNEED.

I don't quite understand the programming model you're describing here.

> 3. The PTEs are mapped if the pages are already present in the
> swap/page-cache, preventing an additional page-fault just to map them.

We could enhance madvise() to do this, no?

> 4. Avoiding heavy-weight reclamation on low memory (this may need to
> be selective, and can be integrated with MADV_WILLNEED).

Likewise.

I don't want to add a new Linux-specific call when there's already a
POSIX interface that communicates the exact same thing.  The return
value seems like the only problem.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_madvise.html
Nadav Amit Feb. 25, 2021, 5:53 p.m. UTC | #8
> On Feb 25, 2021, at 9:32 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Thu, Feb 25, 2021 at 04:56:50PM +0000, Nadav Amit wrote:
>> 
>>> On Feb 25, 2021, at 4:16 AM, Matthew Wilcox <willy@infradead.org> wrote:
>>> 
>>> On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
>>>> Just as applications can use prefetch instructions to overlap
>>>> computations and memory accesses, applications may want to overlap the
>>>> page-faults and compute or overlap the I/O accesses that are required
>>>> for page-faults of different pages.
>>> 
>>> Isn't this madvise(MADV_WILLNEED)?
>> 
>> Good point that I should have mentioned. In a way prefetch_page() a
>> combination of mincore() and MADV_WILLNEED.
>> 
>> There are 4 main differences from MADV_WILLNEED:
>> 
>> 1. Much lower invocation cost if the readahead is not needed: this allows
>> to prefetch pages more abundantly.
> 
> That seems like something that could be fixed in libc -- if we add a
> page prefetch vdso call, an application calling posix_madvise() could
> be implemented by calling this fast path.  Assuming the performance
> increase justifies this extra complexity.
> 
>> 2. Return value: return value tells you whether the page is accessible.
>> This makes it usable for coroutines, for instance. In this regard the
>> call is more similar to mincore() than MADV_WILLNEED.
> 
> I don't quite understand the programming model you're describing here.
> 
>> 3. The PTEs are mapped if the pages are already present in the
>> swap/page-cache, preventing an additional page-fault just to map them.
> 
> We could enhance madvise() to do this, no?
> 
>> 4. Avoiding heavy-weight reclamation on low memory (this may need to
>> be selective, and can be integrated with MADV_WILLNEED).
> 
> Likewise.
> 
> I don't want to add a new Linux-specific call when there's already a
> POSIX interface that communicates the exact same thing.  The return
> value seems like the only problem.

I agree that this call does not have to be exposed to the application.

I am not sure there is a lot of extra complexity now, but obviously
some evaluations are needed.