diff mbox series

[2/4] ARM: Disable HIGHPTE on PREEMPT_RT kernels

Message ID 20241210160556.2341497-3-arnd@kernel.org (mailing list archive)
State New
Headers show
Series ARM: towards 32-bit preempt-rt support | expand

Commit Message

Arnd Bergmann Dec. 10, 2024, 4:05 p.m. UTC
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

gup_pgd_range() is invoked with disabled interrupts and invokes
__kmap_local_page_prot() via pte_offset_map(), gup_p4d_range().
With HIGHPTE enabled, __kmap_local_page_prot() invokes kmap_high_get()
which uses a spinlock_t via lock_kmap_any(). This leads to an
sleeping-while-atomic error on PREEMPT_RT because spinlock_t becomes a
sleeping lock and must not be acquired in atomic context.

The loop in map_new_virtual() uses wait_queue_head_t for wake up which
also is using a spinlock_t.

Since HIGHPTE is rarely needed at all, turn it off for PREEMPT_RT
to allow the use of get_user_pages_fast().

[arnd: rework patch to turn off HIGHPTE instead of HAVE_PAST_GUP]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
There is an open question about whether HIGHPTE is still needed
at all, given how rare 32-bit machines with more than 4GB
are on any architecture. If we instead decide to remove HIGHPTE
altogether, this patch is no longer needed.
---
 arch/arm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Linus Walleij Dec. 11, 2024, 1:29 p.m. UTC | #1
On Tue, Dec 10, 2024 at 5:06 PM Arnd Bergmann <arnd@kernel.org> wrote:

> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>
> gup_pgd_range() is invoked with disabled interrupts and invokes

There is no gup_pgd_range() in the kernel, is this patch a bit
old?

There is gup_fast_pgd_range().
See 23babe1934d7637b598e4c9d9f3876e318fa63a4
gup.c contains:

  get_user_pages_fast attempts to pin user pages by
  walking the page tables directly and avoids taking locks.
(...)
   Let's consistently call the "fast-only" part of GUP "GUP-fast"
   and rename all relevant internal functions to start with
   "gup_fast", to make it clearer that this is not ordinary GUP.
   The current mixture of "lockless",  "gup" and "gup_fast" is
   confusing.

So fast GUP is supposed to be lockless, and should just not
have this problem. So it can't be addressing gup_fast_pgd_range()
right?

> __kmap_local_page_prot() via pte_offset_map(), gup_p4d_range().
> With HIGHPTE enabled, __kmap_local_page_prot() invokes kmap_high_get()
> which uses a spinlock_t via lock_kmap_any(). This leads to an
> sleeping-while-atomic error on PREEMPT_RT because spinlock_t becomes a
> sleeping lock and must not be acquired in atomic context.

I think this needs to be inspected by David Hildenbrand, if he consistently
rename the GPU functions to be "fast" and there is a lock somewhere
deep in there, something must be wrong and violating the API
contract.

> The loop in map_new_virtual() uses wait_queue_head_t for wake up which
> also is using a spinlock_t.
>
> Since HIGHPTE is rarely needed at all, turn it off for PREEMPT_RT
> to allow the use of get_user_pages_fast().
>
> [arnd: rework patch to turn off HIGHPTE instead of HAVE_PAST_GUP]

HAVE_FAST_GUP

I'm still confused, how can something that is supposed to be
lockless "fast" acquire a spinlock? Something is odd here.

> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> There is an open question about whether HIGHPTE is still needed
> at all, given how rare 32-bit machines with more than 4GB
> are on any architecture. If we instead decide to remove HIGHPTE
> altogether, this patch is no longer needed.

I'm more asking if HIGHPTE even acquires a spinlock anymore
as it is supposed to be "fast"/lockless. If it does, it is clearly violating
the "fast" promise of the fast GUP API and should not exist.

Yours,
Linus Walleij
Sebastian Andrzej Siewior Dec. 11, 2024, 1:48 p.m. UTC | #2
On 2024-12-10 17:05:54 [+0100], Arnd Bergmann wrote:
> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> gup_pgd_range() is invoked with disabled interrupts and invokes
> __kmap_local_page_prot() via pte_offset_map(), gup_p4d_range().

s@gup_pgd_range@gup_fast_pgd_range@
s@gup_p4d_range@gup_fast_p4d_range@

The functions got renamed…

> With HIGHPTE enabled, __kmap_local_page_prot() invokes kmap_high_get()
> which uses a spinlock_t via lock_kmap_any(). This leads to an
> sleeping-while-atomic error on PREEMPT_RT because spinlock_t becomes a
> sleeping lock and must not be acquired in atomic context.
> 
> The loop in map_new_virtual() uses wait_queue_head_t for wake up which
> also is using a spinlock_t.
> 
> Since HIGHPTE is rarely needed at all, turn it off for PREEMPT_RT
> to allow the use of get_user_pages_fast().
> 
> [arnd: rework patch to turn off HIGHPTE instead of HAVE_PAST_GUP]
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

This version works, too. Thanks.

> ---
> There is an open question about whether HIGHPTE is still needed
> at all, given how rare 32-bit machines with more than 4GB
> are on any architecture. If we instead decide to remove HIGHPTE
> altogether, this patch is no longer needed.

HIGHPTE isn't much about 4GiB+ but about the page-table which is
offloaded to HIGHMEM. Maybe it is more likely to be needed with 4GiB+ of
memory. No idea. X86 had support for up to 64GiB of memory and is the
only architecture supporting HIGHPTE :)

I guess if you have boxes with 4GiB+ and can proof that the performance
improves without HIGHPTE (since you don't have to map the page table).
The question is then how much of low mem has to be used instead and when
does it start to hurt.

Sebastian
Sebastian Andrzej Siewior Dec. 11, 2024, 2:04 p.m. UTC | #3
On 2024-12-11 14:48:11 [+0100], To Arnd Bergmann wrote:
> I guess if you have boxes with 4GiB+ and can proof that the performance
> improves without HIGHPTE (since you don't have to map the page table).
> The question is then how much of low mem has to be used instead and when
> does it start to hurt.

Some numbers have been been documented in commit
   14315592009c1 ("x86, mm: Allow highmem user page tables to be disabled at boot time")

and I would like cite:
| We could probably handwave up an argument for a threshold at 16G of total
| RAM.

which means HIGHPTE would make sense with >= 16GiB of memory.

Sebastian
Arnd Bergmann Dec. 11, 2024, 2:30 p.m. UTC | #4
On Wed, Dec 11, 2024, at 15:04, Sebastian Andrzej Siewior wrote:
> On 2024-12-11 14:48:11 [+0100], To Arnd Bergmann wrote:
>> I guess if you have boxes with 4GiB+ and can proof that the performance
>> improves without HIGHPTE (since you don't have to map the page table).
>> The question is then how much of low mem has to be used instead and when
>> does it start to hurt.
>
> Some numbers have been been documented in commit
>    14315592009c1 ("x86, mm: Allow highmem user page tables to be 
> disabled at boot time")
>
> and I would like cite:
> | We could probably handwave up an argument for a threshold at 16G of total
> | RAM.
>
> which means HIGHPTE would make sense with >= 16GiB of memory.

Very useful, thanks!

On x86, that means we can definitely remove HIGHPTE along with
CONFIG_HIGHMEM64G on x86.

On 32-bit ARM, we still need to support LPAE for systems that
require 64-bit addressing. LPAE supports 36 bits of addressing
(up to 64GB), but the largest actual size I've seen mentioned
is 16GB (Hisilicon HiP04, Calxeda Midway servers) and I'm
certain nobody actually requires these to perform well
given that they are no longer useful for the workloads they
were designed for.

There are also a small number of embedded systems with 8GB
(Ti Keystone2, NVidia Tegra3, Marvell Armada XP), but they
are rare enough that turning off HIGHPTE is completely safe.

      Arnd
Sebastian Andrzej Siewior Dec. 11, 2024, 3:22 p.m. UTC | #5
On 2024-12-11 14:29:29 [+0100], Linus Walleij wrote:
> So fast GUP is supposed to be lockless, and should just not
> have this problem. So it can't be addressing gup_fast_pgd_range()
> right?> I'm more asking if HIGHPTE even acquires a spinlock anymore
> as it is supposed to be "fast"/lockless. If it does, it is clearly violating
> the "fast" promise of the fast GUP API and should not exist.

This is lockless on x86. The problem is ARM's
arch_kmap_local_high_get(). This is where the lock is from.

> Yours,
> Linus Walleij

Sebastian
Russell King (Oracle) Dec. 11, 2024, 3:55 p.m. UTC | #6
On Wed, Dec 11, 2024 at 03:04:02PM +0100, Sebastian Andrzej Siewior wrote:
> On 2024-12-11 14:48:11 [+0100], To Arnd Bergmann wrote:
> > I guess if you have boxes with 4GiB+ and can proof that the performance
> > improves without HIGHPTE (since you don't have to map the page table).
> > The question is then how much of low mem has to be used instead and when
> > does it start to hurt.
> 
> Some numbers have been been documented in commit
>    14315592009c1 ("x86, mm: Allow highmem user page tables to be disabled at boot time")
> 
> and I would like cite:
> | We could probably handwave up an argument for a threshold at 16G of total
> | RAM.
> 
> which means HIGHPTE would make sense with >= 16GiB of memory.

However, there is more to consider.

32-bit Arm works out at the same for this:

    Assuming 768M of lowmem we have 196608 potential lowmem PTE
    pages. Each page can map 2M of RAM in a PAE-enabled configuration,
    meaning a maximum of 384G of RAM could potentially be mapped using
    lowmem PTEs.

because, presumably, x86 uses 8 bytes per PTE entry, whereas on Arm we
still use 4 bytes, but because we keep two copies of a PTE (one for
hardware, the other for the kernel) it works out that we're the same
there - one PTE page can also map 2M of RAM.

However, what is quite different is the L1 page tables. On x86,
everything is nice and easy, and each page table is one 4k page.
On 32-bit Arm, this is not the case - we need to grab a 16k page for
the L1, and the more immovable allocations we have in lowmem, the
harder it will be to satisfy this. Failing to grab a 16k page
leads to fork() failing and an unusable system.

So, we want to keep as many immovable allocations out of lowmem as
possible - which is an additional constraint x86 doesn't have, and
shouldn't be overlooked without ensuring that the probability of it
happening remains acceptably low.
diff mbox series

Patch

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index ed850cc0ed3c..4de4e5697bdf 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1231,7 +1231,7 @@  config HIGHMEM
 
 config HIGHPTE
 	bool "Allocate 2nd-level pagetables from highmem" if EXPERT
-	depends on HIGHMEM
+	depends on HIGHMEM && !PREEMPT_RT
 	default y
 	help
 	  The VM uses one page of physical memory for each page table.