Message ID | 20241210160556.2341497-3-arnd@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | ARM: towards 32-bit preempt-rt support | expand |
On Tue, Dec 10, 2024 at 5:06 PM Arnd Bergmann <arnd@kernel.org> wrote: > From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > > gup_pgd_range() is invoked with disabled interrupts and invokes There is no gup_pgd_range() in the kernel, is this patch a bit old? There is gup_fast_pgd_range(). See 23babe1934d7637b598e4c9d9f3876e318fa63a4 gup.c contains: get_user_pages_fast attempts to pin user pages by walking the page tables directly and avoids taking locks. (...) Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename all relevant internal functions to start with "gup_fast", to make it clearer that this is not ordinary GUP. The current mixture of "lockless", "gup" and "gup_fast" is confusing. So fast GUP is supposed to be lockless, and should just not have this problem. So it can't be addressing gup_fast_pgd_range() right? > __kmap_local_page_prot() via pte_offset_map(), gup_p4d_range(). > With HIGHPTE enabled, __kmap_local_page_prot() invokes kmap_high_get() > which uses a spinlock_t via lock_kmap_any(). This leads to an > sleeping-while-atomic error on PREEMPT_RT because spinlock_t becomes a > sleeping lock and must not be acquired in atomic context. I think this needs to be inspected by David Hildenbrand, if he consistently rename the GPU functions to be "fast" and there is a lock somewhere deep in there, something must be wrong and violating the API contract. > The loop in map_new_virtual() uses wait_queue_head_t for wake up which > also is using a spinlock_t. > > Since HIGHPTE is rarely needed at all, turn it off for PREEMPT_RT > to allow the use of get_user_pages_fast(). > > [arnd: rework patch to turn off HIGHPTE instead of HAVE_PAST_GUP] HAVE_FAST_GUP I'm still confused, how can something that is supposed to be lockless "fast" acquire a spinlock? Something is odd here. > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > --- > There is an open question about whether HIGHPTE is still needed > at all, given how rare 32-bit machines with more than 4GB > are on any architecture. If we instead decide to remove HIGHPTE > altogether, this patch is no longer needed. I'm more asking if HIGHPTE even acquires a spinlock anymore as it is supposed to be "fast"/lockless. If it does, it is clearly violating the "fast" promise of the fast GUP API and should not exist. Yours, Linus Walleij
On 2024-12-10 17:05:54 [+0100], Arnd Bergmann wrote: > From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > > gup_pgd_range() is invoked with disabled interrupts and invokes > __kmap_local_page_prot() via pte_offset_map(), gup_p4d_range(). s@gup_pgd_range@gup_fast_pgd_range@ s@gup_p4d_range@gup_fast_p4d_range@ The functions got renamed… > With HIGHPTE enabled, __kmap_local_page_prot() invokes kmap_high_get() > which uses a spinlock_t via lock_kmap_any(). This leads to an > sleeping-while-atomic error on PREEMPT_RT because spinlock_t becomes a > sleeping lock and must not be acquired in atomic context. > > The loop in map_new_virtual() uses wait_queue_head_t for wake up which > also is using a spinlock_t. > > Since HIGHPTE is rarely needed at all, turn it off for PREEMPT_RT > to allow the use of get_user_pages_fast(). > > [arnd: rework patch to turn off HIGHPTE instead of HAVE_PAST_GUP] > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> This version works, too. Thanks. > --- > There is an open question about whether HIGHPTE is still needed > at all, given how rare 32-bit machines with more than 4GB > are on any architecture. If we instead decide to remove HIGHPTE > altogether, this patch is no longer needed. HIGHPTE isn't much about 4GiB+ but about the page-table which is offloaded to HIGHMEM. Maybe it is more likely to be needed with 4GiB+ of memory. No idea. X86 had support for up to 64GiB of memory and is the only architecture supporting HIGHPTE :) I guess if you have boxes with 4GiB+ and can proof that the performance improves without HIGHPTE (since you don't have to map the page table). The question is then how much of low mem has to be used instead and when does it start to hurt. Sebastian
On 2024-12-11 14:48:11 [+0100], To Arnd Bergmann wrote: > I guess if you have boxes with 4GiB+ and can proof that the performance > improves without HIGHPTE (since you don't have to map the page table). > The question is then how much of low mem has to be used instead and when > does it start to hurt. Some numbers have been been documented in commit 14315592009c1 ("x86, mm: Allow highmem user page tables to be disabled at boot time") and I would like cite: | We could probably handwave up an argument for a threshold at 16G of total | RAM. which means HIGHPTE would make sense with >= 16GiB of memory. Sebastian
On Wed, Dec 11, 2024, at 15:04, Sebastian Andrzej Siewior wrote: > On 2024-12-11 14:48:11 [+0100], To Arnd Bergmann wrote: >> I guess if you have boxes with 4GiB+ and can proof that the performance >> improves without HIGHPTE (since you don't have to map the page table). >> The question is then how much of low mem has to be used instead and when >> does it start to hurt. > > Some numbers have been been documented in commit > 14315592009c1 ("x86, mm: Allow highmem user page tables to be > disabled at boot time") > > and I would like cite: > | We could probably handwave up an argument for a threshold at 16G of total > | RAM. > > which means HIGHPTE would make sense with >= 16GiB of memory. Very useful, thanks! On x86, that means we can definitely remove HIGHPTE along with CONFIG_HIGHMEM64G on x86. On 32-bit ARM, we still need to support LPAE for systems that require 64-bit addressing. LPAE supports 36 bits of addressing (up to 64GB), but the largest actual size I've seen mentioned is 16GB (Hisilicon HiP04, Calxeda Midway servers) and I'm certain nobody actually requires these to perform well given that they are no longer useful for the workloads they were designed for. There are also a small number of embedded systems with 8GB (Ti Keystone2, NVidia Tegra3, Marvell Armada XP), but they are rare enough that turning off HIGHPTE is completely safe. Arnd
On 2024-12-11 14:29:29 [+0100], Linus Walleij wrote: > So fast GUP is supposed to be lockless, and should just not > have this problem. So it can't be addressing gup_fast_pgd_range() > right? … > I'm more asking if HIGHPTE even acquires a spinlock anymore > as it is supposed to be "fast"/lockless. If it does, it is clearly violating > the "fast" promise of the fast GUP API and should not exist. This is lockless on x86. The problem is ARM's arch_kmap_local_high_get(). This is where the lock is from. > Yours, > Linus Walleij Sebastian
On Wed, Dec 11, 2024 at 03:04:02PM +0100, Sebastian Andrzej Siewior wrote: > On 2024-12-11 14:48:11 [+0100], To Arnd Bergmann wrote: > > I guess if you have boxes with 4GiB+ and can proof that the performance > > improves without HIGHPTE (since you don't have to map the page table). > > The question is then how much of low mem has to be used instead and when > > does it start to hurt. > > Some numbers have been been documented in commit > 14315592009c1 ("x86, mm: Allow highmem user page tables to be disabled at boot time") > > and I would like cite: > | We could probably handwave up an argument for a threshold at 16G of total > | RAM. > > which means HIGHPTE would make sense with >= 16GiB of memory. However, there is more to consider. 32-bit Arm works out at the same for this: Assuming 768M of lowmem we have 196608 potential lowmem PTE pages. Each page can map 2M of RAM in a PAE-enabled configuration, meaning a maximum of 384G of RAM could potentially be mapped using lowmem PTEs. because, presumably, x86 uses 8 bytes per PTE entry, whereas on Arm we still use 4 bytes, but because we keep two copies of a PTE (one for hardware, the other for the kernel) it works out that we're the same there - one PTE page can also map 2M of RAM. However, what is quite different is the L1 page tables. On x86, everything is nice and easy, and each page table is one 4k page. On 32-bit Arm, this is not the case - we need to grab a 16k page for the L1, and the more immovable allocations we have in lowmem, the harder it will be to satisfy this. Failing to grab a 16k page leads to fork() failing and an unusable system. So, we want to keep as many immovable allocations out of lowmem as possible - which is an additional constraint x86 doesn't have, and shouldn't be overlooked without ensuring that the probability of it happening remains acceptably low.
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index ed850cc0ed3c..4de4e5697bdf 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1231,7 +1231,7 @@ config HIGHMEM config HIGHPTE bool "Allocate 2nd-level pagetables from highmem" if EXPERT - depends on HIGHMEM + depends on HIGHMEM && !PREEMPT_RT default y help The VM uses one page of physical memory for each page table.