Message ID | 20220415164413.2727220-1-song@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP | expand |
On Fri, Apr 15, 2022 at 09:44:09AM -0700, Song Liu wrote: > Changes v3 => v4: > 1. Fix __weak module_alloc_huge; remove unused vmalloc_huge; rename > __vmalloc_huge => vmalloc_huge. (Christoph Hellwig) > 2. Use vzalloc (as it was before vmalloc_no_huge) and clean up comments in > kvm_s390_pv_alloc_vm. > > Changes v2 => v3: > 1. Use __vmalloc_huge in alloc_large_system_hash. > 2. Use EXPORT_SYMBOL_GPL for new functions. (Christoph Hellwig) > 3. Add more description about the issues and changes.(Christoph Hellwig, > Rick Edgecombe). > > Changes v1 => v2: > 1. Add vmalloc_huge(). (Christoph Hellwig) > 2. Add module_alloc_huge(). (Christoph Hellwig) > 3. Add Fixes tag and Link tag. (Thorsten Leemhuis) > > Enabling HAVE_ARCH_HUGE_VMALLOC on x86_64 and use it for bpf_prog_pack has > caused some issues [1], as many users of vmalloc are not yet ready to > handle huge pages. To enable a more smooth transition to use huge page > backed vmalloc memory, this set replaces VM_NO_HUGE_VMAP flag with an new > opt-in flag, VM_ALLOW_HUGE_VMAP. More discussions about this topic can be > found at [2]. > > Patch 1 removes VM_NO_HUGE_VMAP and adds VM_ALLOW_HUGE_VMAP. > Patch 2 uses VM_ALLOW_HUGE_VMAP in bpf_prog_pack. > > [1] https://lore.kernel.org/lkml/20220204185742.271030-1-song@kernel.org/ > [2] https://lore.kernel.org/linux-mm/20220330225642.1163897-1-song@kernel.org/ Looks good except for that I think this should just wait for v5.19. The fixes are so large I can't see why this needs to be rushed in other than the first assumptions of the optimizations had some flaws addressed here. If this goes through v5.19 expect conflicts on modules unless you use modules-testing. Luis
On Fri, Apr 15, 2022 at 12:05 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Fri, Apr 15, 2022 at 09:44:09AM -0700, Song Liu wrote: > > Changes v3 => v4: > > 1. Fix __weak module_alloc_huge; remove unused vmalloc_huge; rename > > __vmalloc_huge => vmalloc_huge. (Christoph Hellwig) > > 2. Use vzalloc (as it was before vmalloc_no_huge) and clean up comments in > > kvm_s390_pv_alloc_vm. > > > > Changes v2 => v3: > > 1. Use __vmalloc_huge in alloc_large_system_hash. > > 2. Use EXPORT_SYMBOL_GPL for new functions. (Christoph Hellwig) > > 3. Add more description about the issues and changes.(Christoph Hellwig, > > Rick Edgecombe). > > > > Changes v1 => v2: > > 1. Add vmalloc_huge(). (Christoph Hellwig) > > 2. Add module_alloc_huge(). (Christoph Hellwig) > > 3. Add Fixes tag and Link tag. (Thorsten Leemhuis) > > > > Enabling HAVE_ARCH_HUGE_VMALLOC on x86_64 and use it for bpf_prog_pack has > > caused some issues [1], as many users of vmalloc are not yet ready to > > handle huge pages. To enable a more smooth transition to use huge page > > backed vmalloc memory, this set replaces VM_NO_HUGE_VMAP flag with an new > > opt-in flag, VM_ALLOW_HUGE_VMAP. More discussions about this topic can be > > found at [2]. > > > > Patch 1 removes VM_NO_HUGE_VMAP and adds VM_ALLOW_HUGE_VMAP. > > Patch 2 uses VM_ALLOW_HUGE_VMAP in bpf_prog_pack. > > > > [1] https://lore.kernel.org/lkml/20220204185742.271030-1-song@kernel.org/ > > [2] https://lore.kernel.org/linux-mm/20220330225642.1163897-1-song@kernel.org/ > > Looks good except for that I think this should just wait for v5.19. The > fixes are so large I can't see why this needs to be rushed in other than > the first assumptions of the optimizations had some flaws addressed here. We need these changes to fix issues like [3]. Note that there might still be some undiscovered issues with huge page backed vmalloc memory on powerpc, which had HAVE_ARCH_HUGE_VMALLOC enabled since the 5.15 kernel. As we agreed, the new opt-in flag is a safer approach here. We probably should have 1/4 and 2/4 back ported to stable. Therefore, I think shipping this set now would give us a more reliable 5.18 release. Does this make sense? Thanks, Song [3] https://lore.kernel.org/lkml/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/
On Fri, Apr 15, 2022 at 06:34:16PM -0700, Song Liu wrote: > On Fri, Apr 15, 2022 at 12:05 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > On Fri, Apr 15, 2022 at 09:44:09AM -0700, Song Liu wrote: > > > Changes v3 => v4: > > > 1. Fix __weak module_alloc_huge; remove unused vmalloc_huge; rename > > > __vmalloc_huge => vmalloc_huge. (Christoph Hellwig) > > > 2. Use vzalloc (as it was before vmalloc_no_huge) and clean up comments in > > > kvm_s390_pv_alloc_vm. > > > > > > Changes v2 => v3: > > > 1. Use __vmalloc_huge in alloc_large_system_hash. > > > 2. Use EXPORT_SYMBOL_GPL for new functions. (Christoph Hellwig) > > > 3. Add more description about the issues and changes.(Christoph Hellwig, > > > Rick Edgecombe). > > > > > > Changes v1 => v2: > > > 1. Add vmalloc_huge(). (Christoph Hellwig) > > > 2. Add module_alloc_huge(). (Christoph Hellwig) > > > 3. Add Fixes tag and Link tag. (Thorsten Leemhuis) > > > > > > Enabling HAVE_ARCH_HUGE_VMALLOC on x86_64 and use it for bpf_prog_pack has > > > caused some issues [1], as many users of vmalloc are not yet ready to > > > handle huge pages. To enable a more smooth transition to use huge page > > > backed vmalloc memory, this set replaces VM_NO_HUGE_VMAP flag with an new > > > opt-in flag, VM_ALLOW_HUGE_VMAP. More discussions about this topic can be > > > found at [2]. > > > > > > Patch 1 removes VM_NO_HUGE_VMAP and adds VM_ALLOW_HUGE_VMAP. > > > Patch 2 uses VM_ALLOW_HUGE_VMAP in bpf_prog_pack. > > > > > > [1] https://lore.kernel.org/lkml/20220204185742.271030-1-song@kernel.org/ > > > [2] https://lore.kernel.org/linux-mm/20220330225642.1163897-1-song@kernel.org/ > > > > Looks good except for that I think this should just wait for v5.19. The > > fixes are so large I can't see why this needs to be rushed in other than > > the first assumptions of the optimizations had some flaws addressed here. > > We need these changes to fix issues like [3]. Note that there might > still be some > undiscovered issues with huge page backed vmalloc memory on powerpc, which > had HAVE_ARCH_HUGE_VMALLOC enabled since the 5.15 kernel. As we > agreed, the new opt-in flag is a safer approach here. We probably should have > 1/4 and 2/4 back ported to stable. Therefore, I think shipping this > set now would > give us a more reliable 5.18 release. > > Does this make sense? Yes absolutely, but that sounds like that optimization should just be reverted completely from v5.18 isntead then no? Or if one can skip the optimizations for v5.18 with a small patch, and then enable for v5.19. Luis
On Fri, Apr 15, 2022 at 06:42:27PM -0700, Luis Chamberlain wrote: > On Fri, Apr 15, 2022 at 06:34:16PM -0700, Song Liu wrote: > > On Fri, Apr 15, 2022 at 12:05 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > > > On Fri, Apr 15, 2022 at 09:44:09AM -0700, Song Liu wrote: > > > > Changes v3 => v4: > > > > 1. Fix __weak module_alloc_huge; remove unused vmalloc_huge; rename > > > > __vmalloc_huge => vmalloc_huge. (Christoph Hellwig) > > > > 2. Use vzalloc (as it was before vmalloc_no_huge) and clean up comments in > > > > kvm_s390_pv_alloc_vm. > > > > > > > > Changes v2 => v3: > > > > 1. Use __vmalloc_huge in alloc_large_system_hash. > > > > 2. Use EXPORT_SYMBOL_GPL for new functions. (Christoph Hellwig) > > > > 3. Add more description about the issues and changes.(Christoph Hellwig, > > > > Rick Edgecombe). > > > > > > > > Changes v1 => v2: > > > > 1. Add vmalloc_huge(). (Christoph Hellwig) > > > > 2. Add module_alloc_huge(). (Christoph Hellwig) > > > > 3. Add Fixes tag and Link tag. (Thorsten Leemhuis) > > > > > > > > Enabling HAVE_ARCH_HUGE_VMALLOC on x86_64 and use it for bpf_prog_pack has > > > > caused some issues [1], as many users of vmalloc are not yet ready to > > > > handle huge pages. To enable a more smooth transition to use huge page > > > > backed vmalloc memory, this set replaces VM_NO_HUGE_VMAP flag with an new > > > > opt-in flag, VM_ALLOW_HUGE_VMAP. More discussions about this topic can be > > > > found at [2]. > > > > > > > > Patch 1 removes VM_NO_HUGE_VMAP and adds VM_ALLOW_HUGE_VMAP. > > > > Patch 2 uses VM_ALLOW_HUGE_VMAP in bpf_prog_pack. > > > > > > > > [1] https://lore.kernel.org/lkml/20220204185742.271030-1-song@kernel.org/ > > > > [2] https://lore.kernel.org/linux-mm/20220330225642.1163897-1-song@kernel.org/ > > > > > > Looks good except for that I think this should just wait for v5.19. The > > > fixes are so large I can't see why this needs to be rushed in other than > > > the first assumptions of the optimizations had some flaws addressed here. > > > > We need these changes to fix issues like [3]. Note that there might > > still be some > > undiscovered issues with huge page backed vmalloc memory on powerpc, which > > had HAVE_ARCH_HUGE_VMALLOC enabled since the 5.15 kernel. As we > > agreed, the new opt-in flag is a safer approach here. We probably should have > > 1/4 and 2/4 back ported to stable. Therefore, I think shipping this > > set now would > > give us a more reliable 5.18 release. > > > > Does this make sense? > > Yes absolutely, but that sounds like that optimization should just be > reverted completely from v5.18 isntead then no? Or if one can skip the > optimizations for v5.18 with a small patch, and then enable for v5.19. In any case it is up to Linus, I already chimed in with my opinion. Luis
On Fri, Apr 15, 2022 at 12:05:42PM -0700, Luis Chamberlain wrote: > Looks good except for that I think this should just wait for v5.19. The > fixes are so large I can't see why this needs to be rushed in other than > the first assumptions of the optimizations had some flaws addressed here. Patches 1 and 2 are bug fixes for regressions caused by using huge page backed vmalloc by default. So I think we do need it for 5.18. The other two do look like candidates for 5.19, though.
> On Apr 15, 2022, at 10:08 PM, Christoph Hellwig <hch@infradead.org> wrote: > > On Fri, Apr 15, 2022 at 12:05:42PM -0700, Luis Chamberlain wrote: >> Looks good except for that I think this should just wait for v5.19. The >> fixes are so large I can't see why this needs to be rushed in other than >> the first assumptions of the optimizations had some flaws addressed here. > > Patches 1 and 2 are bug fixes for regressions caused by using huge page > backed vmalloc by default. So I think we do need it for 5.18. The > other two do look like candidates for 5.19, though. Thanks Luis and Christoph for your kind inputs on the set. Here are my analysis after thinking about it overnight. We can discuss the users of vmalloc in 4 categories: module_alloc, BPF programs, alloc_large_system_hash, and others; and there are two archs involved here: x86_64 and powerpc. With whole set, the behavior is like: | x86_64 | powerpc --------------------------------------------+---------------------- module_alloc | use small pages --------------------------------------------+---------------------- BPF programs | use 2MB pages | use small changes --------------------------------------------+---------------------- large hash | use huge pages when size > PMD_SIZE --------------------------------------------+---------------------- other-vmalloc | use small pages Patch 1/4 fixes the behavior of module_alloc and other-vmalloc. Without 1/4, both these users may get huge pages for size > PMD_SIZE allocations, which may be troublesome([3] for example). Patch 3/4 and 4/4, together with 1/1, allows BPF programs use 2MB pages. This is the same behavior as before 5.18-rc1, which has been tested in bpf-next and linux-next. Therefore, I don't think we need to hold them until 5.19. Patch 2/4 enables huge pages for large hash. Large hash has been using huge pages on powerpc since 5.15. But this is new for x86_64. If we ship 2/4, this is a performance improvement for x86_64, but it is less tested on x86_64 (didn't go through linux-next). If we ship 1/4 but not 2/4 with 5.18, we will see a small performance regression for powerpc. Based on this analysis, I think we should either 1) ship the whole set with 5.18; or 2) ship 1/4, 3/4, and 4/4 with 5.18, and 2/4 with 5.19. With option 1), we enables huge pages for large hash on x86_64 without going through linux-next. With option 2), we take a small performance regression with 5.18 on powerpc. Of course, we can ship a hybrid solution by gating 2/4 for powerpc only in 5.18, and enabling it for x86_64 in 5.19. Does this make sense? Please let me know you comments and suggestions on this. Thanks, Song [3] https://lore.kernel.org/lkml/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/
On Sat, Apr 16, 2022 at 12:55 PM Song Liu <songliubraving@fb.com> wrote: > > Based on this analysis, I think we should either > 1) ship the whole set with 5.18; or > 2) ship 1/4, 3/4, and 4/4 with 5.18, and 2/4 with 5.19. Honestly, I think the proper thing to do is - apply #1, because yes, that "use huge pages" should be an opt-in. - but just disable hugepages for now. I think those games with set_memory_nx() and friends just show how rough this all is right now. In fact, I personally think that the whole bpf 'prog_pack' stuff should probably be disabled. It looks incredible broken to me right now. Unless I mis-read it, it does a "module_alloc()" to allocate the vmap area, and then just marks it executable without having even initialized the pages. Am I missing something? So now we have random kernel memory that is marked executable. Sure, it's also marked RO, but who cares? It's random data that is now executable. Maybe I am missing something, but I really don't think this is ready for prime-time. We should effectively disable it all, and have people think through it a lot more. Linus
Hi Linus, Thanks a lot for your kind reply. > On Apr 16, 2022, at 1:30 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, Apr 16, 2022 at 12:55 PM Song Liu <songliubraving@fb.com> wrote: >> >> Based on this analysis, I think we should either >> 1) ship the whole set with 5.18; or >> 2) ship 1/4, 3/4, and 4/4 with 5.18, and 2/4 with 5.19. > > Honestly, I think the proper thing to do is > > - apply #1, because yes, that "use huge pages" should be an opt-in. > > - but just disable hugepages for now. Hmm.. This sure is an option... > > I think those games with set_memory_nx() and friends just show how > rough this all is right now. > > In fact, I personally think that the whole bpf 'prog_pack' stuff > should probably be disabled. It looks incredible broken to me right > now. > > Unless I mis-read it, it does a "module_alloc()" to allocate the vmap > area, and then just marks it executable without having even > initialized the pages. Am I missing something? So now we have random > kernel memory that is marked executable. > > Sure, it's also marked RO, but who cares? It's random data that is now > executable. While this is a serious issue (my fault), it is relatively easy to fix. We can fill the whole space with invalid instructions when the page is allocated and when a BPF program is freed. Both these operations are on the slow paths, so the overhead is minimal. > > Maybe I am missing something, but I really don't think this is ready > for prime-time. We should effectively disable it all, and have people > think through it a lot more. This has been discussed on lwn.net: https://lwn.net/Articles/883454/. AFAICT, the biggest concern is whether reserving minimal 2MB for BPF programs is a good trade-off for memory usage. This is again my fault not to state the motivation clearly: the primary gain comes from less page table fragmentation and thus better iTLB efficiency. Other folks (in recent thread on this topic and offline in other discussions) also showed strong interests in using similar technical for text of kernel modules. So I would really like to learn your opinion on this. There are many details we can optimize, but I guess the general mechanism has to be something like: - allocate a huge page, make it safe, and set it as executable; - as users (BPF, kernel module, etc.) request memory for text, give a chunk of the huge page to the user. - use some mechanism to update the chunk of memory safely. As you correctly noted, I missed the "make it safe" step (which should be easy to fix). But the rest of the flow is more or less the best solution to me. Did I totally miss some better option here? Or is this really not a viable option (IOW, bpf programs and kernel modules have to use regular pages)? Thanks again, Song
Hi, On Sat, Apr 16, 2022 at 10:26:08PM +0000, Song Liu wrote: > > On Apr 16, 2022, at 1:30 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > Maybe I am missing something, but I really don't think this is ready > > for prime-time. We should effectively disable it all, and have people > > think through it a lot more. > > This has been discussed on lwn.net: https://lwn.net/Articles/883454/. > AFAICT, the biggest concern is whether reserving minimal 2MB for BPF > programs is a good trade-off for memory usage. This is again my fault > not to state the motivation clearly: the primary gain comes from less > page table fragmentation and thus better iTLB efficiency. Reserving 2MB pages for BPF programs will indeed reduce the fragmentation, but OTOH it will reduce memory utilization. If for large systems this may not be an issue, on smaller machines trading off memory for iTLB performance may be not that obvious. > Other folks (in recent thread on this topic and offline in other > discussions) also showed strong interests in using similar technical > for text of kernel modules. So I would really like to learn your > opinion on this. There are many details we can optimize, but I guess > the general mechanism has to be something like: > - allocate a huge page, make it safe, and set it as executable; > - as users (BPF, kernel module, etc.) request memory for text, give > a chunk of the huge page to the user. > - use some mechanism to update the chunk of memory safely. There are use-cases that require 4K pages with non-default permissions in the direct map and the pages not necessarily should be executable. There were several suggestions to implement caches of 4K pages backed by 2M pages. I believe that "allocate huge page and split it to basic pages to hand out to users" concept should be implemented at page allocator level and I posted and RFC for this a while ago: https://lore.kernel.org/all/20220127085608.306306-1-rppt@kernel.org/
On Mon, Apr 18, 2022 at 01:06:36PM +0300, Mike Rapoport wrote: > Hi, > > On Sat, Apr 16, 2022 at 10:26:08PM +0000, Song Liu wrote: > > > On Apr 16, 2022, at 1:30 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > > Maybe I am missing something, but I really don't think this is ready > > > for prime-time. We should effectively disable it all, and have people > > > think through it a lot more. > > > > This has been discussed on lwn.net: https://lwn.net/Articles/883454/. > > AFAICT, the biggest concern is whether reserving minimal 2MB for BPF > > programs is a good trade-off for memory usage. This is again my fault > > not to state the motivation clearly: the primary gain comes from less > > page table fragmentation and thus better iTLB efficiency. > > Reserving 2MB pages for BPF programs will indeed reduce the fragmentation, > but OTOH it will reduce memory utilization. If for large systems this may > not be an issue, on smaller machines trading off memory for iTLB > performance may be not that obvious. So the current optimization at best should be a kconfig option? > > Other folks (in recent thread on this topic and offline in other > > discussions) also showed strong interests in using similar technical > > for text of kernel modules. So I would really like to learn your > > opinion on this. There are many details we can optimize, but I guess > > the general mechanism has to be something like: > > - allocate a huge page, make it safe, and set it as executable; > > - as users (BPF, kernel module, etc.) request memory for text, give > > a chunk of the huge page to the user. > > - use some mechanism to update the chunk of memory safely. > > There are use-cases that require 4K pages with non-default permissions in > the direct map and the pages not necessarily should be executable. There > were several suggestions to implement caches of 4K pages backed by 2M > pages. Even if we just focus on the executable side of the story... there may be users who can share this too. I've gone down memory lane now at least down to year 2005 in kprobes to see why the heck module_alloc() was used. At first glance there are some old comments about being within the 2 GiB text kernel range... But some old tribal knowledge is still lost. The real hints come from kprobe work since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of line - take2"), so that the "For the %rip-relative displacement fixups to be doable"... but this got me wondering, would other users who *do* want similar funcionality benefit from a cache. If the space is limited then using a cache makes sense. Specially if architectures tend to require hacks for some of this to all work. Then, since it seems since the vmalloc area was not initialized, wouldn't that break the old JIT spray fixes, refer to commit 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying attacks")? Is that sort of work not needed anymore? If in doubt I at least made the old proof of concept JIT spray stuff compile on recent kernels [0], but I haven't tried out your patches yet. If this is not needed anymore, why not? The collection of tribal knowedge around these sorts of things would be good to not loose and if we can share, even better. > I believe that "allocate huge page and split it to basic pages to hand out > to users" concept should be implemented at page allocator level and I > posted and RFC for this a while ago: > > https://lore.kernel.org/all/20220127085608.306306-1-rppt@kernel.org/ Neat, so although eBPF is a big user, are there some use cases outside that immediately benefit? [0] https://github.com/mcgrof/jit-spray-poc-for-ksp LUis
On Mon, 2022-04-18 at 17:44 -0700, Luis Chamberlain wrote: > > There are use-cases that require 4K pages with non-default > > permissions in > > the direct map and the pages not necessarily should be executable. > > There > > were several suggestions to implement caches of 4K pages backed by > > 2M > > pages. > > Even if we just focus on the executable side of the story... there > may > be users who can share this too. > > I've gone down memory lane now at least down to year 2005 in kprobes > to see why the heck module_alloc() was used. At first glance there > are > some old comments about being within the 2 GiB text kernel range... > But > some old tribal knowledge is still lost. The real hints come from > kprobe work > since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of > line > - take2"), so that the "For the %rip-relative displacement fixups to > be > doable"... but this got me wondering, would other users who *do* want > similar funcionality benefit from a cache. If the space is limited > then > using a cache makes sense. Specially if architectures tend to require > hacks for some of this to all work. Yea, that was my understanding. X86 modules have to be linked within 2GB of the kernel text, also eBPF x86 JIT generates code that expects to be within 2GB of the kernel text. I think of two types of caches we could have: caches of unmapped pages on the direct map and caches of virtual memory mappings. Caches of pages on the direct map reduce breakage of the large pages (and is somewhat x86 specific problem). Caches of virtual memory mappings reduce shootdowns, and are also required to share huge pages. I'll plug my old RFC, where I tried to work towards enabling both: https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/ Since then Mike has taken a lot further the direct map cache piece. Yea, probably a lot of JIT's are way smaller than a page, but there is also hopefully some performance benefit of reduced ITLB pressure and TLB shootdowns. I think kprobes/ftrace (or at least one of them) keeps its own cache of a page for putting very small trampolines. > > Then, since it seems since the vmalloc area was not initialized, > wouldn't that break the old JIT spray fixes, refer to commit > 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying > attacks")? Hmm, yea it might be a way to get around the ebpf jit rlimit. The allocator could just text_poke() invalid instructions on "free" of the jit. > > Is that sort of work not needed anymore? If in doubt I at least made > the > old proof of concept JIT spray stuff compile on recent kernels [0], > but > I haven't tried out your patches yet. If this is not needed anymore, > why not? IIRC this got addressed in two ways, randomizing of the jit offset inside the vmalloc allocation, and "constant blinding", such that the specific attack of inserting unaligned instructions as immediate instruction data did not work. Neither of those mitigations seem unworkable with a large page caching allocator. > > The collection of tribal knowedge around these sorts of things would > be > good to not loose and if we can share, even better. Totally agree here. I think the abstraction I was exploring in that RFC could remove some of the special permission memory tribal knowledge that is lurking in in the cross-arch module.c. I wonder if you have any thoughts on something like that? The normal modules proved the hardest.
Hi Mike, Luis, and Rick, Thanks for sharing your work and findings in the space. I didn't realize we were looking at the same set of problems. > On Apr 18, 2022, at 6:56 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Mon, 2022-04-18 at 17:44 -0700, Luis Chamberlain wrote: >>> There are use-cases that require 4K pages with non-default >>> permissions in >>> the direct map and the pages not necessarily should be executable. >>> There >>> were several suggestions to implement caches of 4K pages backed by >>> 2M >>> pages. >> >> Even if we just focus on the executable side of the story... there >> may >> be users who can share this too. >> >> I've gone down memory lane now at least down to year 2005 in kprobes >> to see why the heck module_alloc() was used. At first glance there >> are >> some old comments about being within the 2 GiB text kernel range... >> But >> some old tribal knowledge is still lost. The real hints come from >> kprobe work >> since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of >> line >> - take2"), so that the "For the %rip-relative displacement fixups to >> be >> doable"... but this got me wondering, would other users who *do* want >> similar funcionality benefit from a cache. If the space is limited >> then >> using a cache makes sense. Specially if architectures tend to require >> hacks for some of this to all work. > > Yea, that was my understanding. X86 modules have to be linked within > 2GB of the kernel text, also eBPF x86 JIT generates code that expects > to be within 2GB of the kernel text. > > > I think of two types of caches we could have: caches of unmapped pages > on the direct map and caches of virtual memory mappings. Caches of > pages on the direct map reduce breakage of the large pages (and is > somewhat x86 specific problem). Caches of virtual memory mappings > reduce shootdowns, and are also required to share huge pages. I'll plug > my old RFC, where I tried to work towards enabling both: > > https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/ > > Since then Mike has taken a lot further the direct map cache piece. These are really interesting work. With this landed, we won't need the bpf_prog_pack work at all (I think). OTOH, this looks like a long term project, as some of the work in bpf_prog_pack took quite some time to discuss/debate, and that was just a subset of the whole thing. I really like the two types of cache concept. But there are some details I cannot figure out about them: 1. Is "caches of unmapped pages on direct map" (cache #1) sufficient to fix all direct map fragmentation? IIUC, pages in the cache may still be used by other allocation (with some memory pressure). If the system runs for long enough, there may be a lot of direct map fragmentation. Is this right? 2. If we have "cache of virtual memory mappings" (cache #2), do we still need cache #1? I know cache #2 alone may waste some memory, but I still think 2MB within noise for modern systems. 3. If we do need both caches, what would be the right APIs? Thanks, Song > Yea, probably a lot of JIT's are way smaller than a page, but there is > also hopefully some performance benefit of reduced ITLB pressure and > TLB shootdowns. I think kprobes/ftrace (or at least one of them) keeps > its own cache of a page for putting very small trampolines. > >> >> Then, since it seems since the vmalloc area was not initialized, >> wouldn't that break the old JIT spray fixes, refer to commit >> 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying >> attacks")? > > Hmm, yea it might be a way to get around the ebpf jit rlimit. The > allocator could just text_poke() invalid instructions on "free" of the > jit. > >> >> Is that sort of work not needed anymore? If in doubt I at least made >> the >> old proof of concept JIT spray stuff compile on recent kernels [0], >> but >> I haven't tried out your patches yet. If this is not needed anymore, >> why not? > > IIRC this got addressed in two ways, randomizing of the jit offset > inside the vmalloc allocation, and "constant blinding", such that the > specific attack of inserting unaligned instructions as immediate > instruction data did not work. Neither of those mitigations seem > unworkable with a large page caching allocator. > >> >> The collection of tribal knowedge around these sorts of things would >> be >> good to not loose and if we can share, even better. > > Totally agree here. I think the abstraction I was exploring in that RFC > could remove some of the special permission memory tribal knowledge > that is lurking in in the cross-arch module.c. I wonder if you have any > thoughts on something like that? The normal modules proved the hardest. >
On Mon, Apr 18, 2022 at 05:44:19PM -0700, Luis Chamberlain wrote: > On Mon, Apr 18, 2022 at 01:06:36PM +0300, Mike Rapoport wrote: > > Hi, > > > > On Sat, Apr 16, 2022 at 10:26:08PM +0000, Song Liu wrote: > > > > On Apr 16, 2022, at 1:30 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > > > > Maybe I am missing something, but I really don't think this is ready > > > > for prime-time. We should effectively disable it all, and have people > > > > think through it a lot more. > > > > > > This has been discussed on lwn.net: https://lwn.net/Articles/883454/. > > > AFAICT, the biggest concern is whether reserving minimal 2MB for BPF > > > programs is a good trade-off for memory usage. This is again my fault > > > not to state the motivation clearly: the primary gain comes from less > > > page table fragmentation and thus better iTLB efficiency. > > > > Reserving 2MB pages for BPF programs will indeed reduce the fragmentation, > > but OTOH it will reduce memory utilization. If for large systems this may > > not be an issue, on smaller machines trading off memory for iTLB > > performance may be not that obvious. > > So the current optimization at best should be a kconfig option? Maybe not and it'll be fine on smaller systems, but from what I see the bpf_prog_pack implementation didn't consider them. And if we move the caches from BPF to vmalloc or page allocator that would be much less of an issue. > > I believe that "allocate huge page and split it to basic pages to hand out > > to users" concept should be implemented at page allocator level and I > > posted and RFC for this a while ago: > > > > https://lore.kernel.org/all/20220127085608.306306-1-rppt@kernel.org/ > > Neat, so although eBPF is a big user, are there some use cases outside > that immediately benefit? Anything that uses set_memory APIs could benefit from this. Except eBPF and other module_alloc() users, there is secretmem that also fractures the direct map and actually that was my initial use case for these patches. Another possible use-case can be protection of page tables with PKS: https://lore.kernel.org/lkml/20210505003032.489164-1-rick.p.edgecombe@intel.com/ Vlastimil also mentioned that SEV-SNP could use such caching mechanism, but I don't know the details. > LUis
Hi, On Tue, Apr 19, 2022 at 05:36:45AM +0000, Song Liu wrote: > Hi Mike, Luis, and Rick, > > Thanks for sharing your work and findings in the space. I didn't > realize we were looking at the same set of problems. > > > On Apr 18, 2022, at 6:56 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > > > On Mon, 2022-04-18 at 17:44 -0700, Luis Chamberlain wrote: > >>> There are use-cases that require 4K pages with non-default > >>> permissions in > >>> the direct map and the pages not necessarily should be executable. > >>> There > >>> were several suggestions to implement caches of 4K pages backed by > >>> 2M > >>> pages. > >> > >> Even if we just focus on the executable side of the story... there > >> may > >> be users who can share this too. > >> > >> I've gone down memory lane now at least down to year 2005 in kprobes > >> to see why the heck module_alloc() was used. At first glance there > >> are > >> some old comments about being within the 2 GiB text kernel range... > >> But > >> some old tribal knowledge is still lost. The real hints come from > >> kprobe work > >> since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of > >> line > >> - take2"), so that the "For the %rip-relative displacement fixups to > >> be > >> doable"... but this got me wondering, would other users who *do* want > >> similar funcionality benefit from a cache. If the space is limited > >> then > >> using a cache makes sense. Specially if architectures tend to require > >> hacks for some of this to all work. > > > > Yea, that was my understanding. X86 modules have to be linked within > > 2GB of the kernel text, also eBPF x86 JIT generates code that expects > > to be within 2GB of the kernel text. > > > > > > I think of two types of caches we could have: caches of unmapped pages > > on the direct map and caches of virtual memory mappings. Caches of > > pages on the direct map reduce breakage of the large pages (and is > > somewhat x86 specific problem). Caches of virtual memory mappings > > reduce shootdowns, and are also required to share huge pages. I'll plug > > my old RFC, where I tried to work towards enabling both: > > > > https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/ > > > > Since then Mike has taken a lot further the direct map cache piece. > > These are really interesting work. With this landed, we won't need > the bpf_prog_pack work at all (I think). OTOH, this looks like a > long term project, as some of the work in bpf_prog_pack took quite > some time to discuss/debate, and that was just a subset of the > whole thing. I'd say that bpf_prog_pack was a cure for symptoms and this project tries to address more general problem. But you are right, it'll take some time and won't land in 5.19. > I really like the two types of cache concept. But there are some > details I cannot figure out about them: After some discussions we decided to try moving the caching of large pages to the page allocator and see if the second cache will be needed at all. But I've got distracted after posting the RFC and that work didn't have real progress since then. > 1. Is "caches of unmapped pages on direct map" (cache #1) > sufficient to fix all direct map fragmentation? IIUC, pages in > the cache may still be used by other allocation (with some > memory pressure). If the system runs for long enough, there > may be a lot of direct map fragmentation. Is this right? If the system runs long enough, it may run out of high-order free pages regardless of the way the caches are implemented. Then we either fail the allocation because it is impossible to refill the cache with large pages or fall back to 4k pages and fragment direct map. I don't see how can we avoid direct map fragmentation entirely and still be able to allocate memory for users of set_memory APIs. > 2. If we have "cache of virtual memory mappings" (cache #2), do we > still need cache #1? I know cache #2 alone may waste some > memory, but I still think 2MB within noise for modern systems. I presume that by cache #1 you mean the cache in the page allocator. In that case cache #2 is probably not needed at all, because the cache at page allocator level will be used by vmalloc() and friends to provide what Rick called "permissioned allocations". > Thanks, > Song
On Tue, Apr 19, 2022 at 11:42 AM Mike Rapoport <rppt@kernel.org> wrote: > > I'd say that bpf_prog_pack was a cure for symptoms and this project tries > to address more general problem. > But you are right, it'll take some time and won't land in 5.19. Just to update people: I've just applied Song's [1/4] patch, which means that the whole current hugepage vmalloc thing is effectively disabled (because nothing opts in). And I suspect that will be the status for 5.18, unless somebody comes up with some very strong arguments for (re-)starting using huge pages. Linus
On Tue, Apr 19, 2022 at 01:56:03AM +0000, Edgecombe, Rick P wrote: > Yea, that was my understanding. X86 modules have to be linked within > 2GB of the kernel text, also eBPF x86 JIT generates code that expects > to be within 2GB of the kernel text. And kprobes / live patching / ftrace. Another architectural fun fact, powerpc book3s/32 requires executability to be set per 256 Mbytes segments. Some architectures like this one will want to also optimize how they use the module alloc area. Even though today the use cases might be limited, we don't exactly know how much memory a target device has a well, and so treating memory failures for "special memory" request as regular memory failures seems a bit odd, and users could get confused. For instance slapping on extra memory on a system won't resolve any issues if the limit for a special type of memory is already hit. Very likely not a problem at all today, given how small modules / eBPF jit programs are / etc, but conceptually it would seem wrong to just say -ENOMEM when in fact it's a special type of required memory which cannot be allocated and the issue cannot possibly be fixed. I don't think we have an option but to use -ENOMEM but at least hinting of the special failure would have seem desirable. Do we have other type of architectural limitations for "special memory" other than executable? Do we have *new* types of special memory we should consider which might be similar / limited in nature? And can / could / should these architectural limitations hopefully be disappear in newer CPUs? I see vmalloc_pks() as you pointed out [0] . Anything else? > I think of two types of caches we could have: caches of unmapped pages > on the direct map and caches of virtual memory mappings. Caches of > pages on the direct map reduce breakage of the large pages (and is > somewhat x86 specific problem). Caches of virtual memory mappings > reduce shootdowns, and are also required to share huge pages. I'll plug > my old RFC, where I tried to work towards enabling both: > > https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/ > > Since then Mike has taken a lot further the direct map cache piece. > > Yea, probably a lot of JIT's are way smaller than a page, but there is > also hopefully some performance benefit of reduced ITLB pressure and > TLB shootdowns. I think kprobes/ftrace (or at least one of them) keeps > its own cache of a page for putting very small trampolines. The reason I looked into *why* module_alloc() was used was particularly because it seemed a bit odd to have such ITLB enhancements for such a niche use case and we couldn't have desired this elsewhere before. > > Then, since it seems since the vmalloc area was not initialized, > > wouldn't that break the old JIT spray fixes, refer to commit > > 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying > > attacks")? > > Hmm, yea it might be a way to get around the ebpf jit rlimit. The > allocator could just text_poke() invalid instructions on "free" of the > jit. > > > > > Is that sort of work not needed anymore? If in doubt I at least made > > the > > old proof of concept JIT spray stuff compile on recent kernels [0], > > but > > I haven't tried out your patches yet. If this is not needed anymore, > > why not? > > IIRC this got addressed in two ways, randomizing of the jit offset > inside the vmalloc allocation, and "constant blinding", such that the > specific attack of inserting unaligned instructions as immediate > instruction data did not work. Neither of those mitigations seem > unworkable with a large page caching allocator. Got it, but was it *also* considerd in the fixes posted recently? > > The collection of tribal knowedge around these sorts of things would > > be > > good to not loose and if we can share, even better. > > Totally agree here. I think the abstraction I was exploring in that RFC > could remove some of the special permission memory tribal knowledge > that is lurking in in the cross-arch module.c. I wonder if you have any > thoughts on something like that? The normal modules proved the hardest. Yeah modules will be harder now with the new ARCH_WANTS_MODULES_DATA_IN_VMALLOC which Christophe Leroy added (queued in my modules-next). At a quick glance it seems like an API in the right direction, but you just need more architecture folks other than the usual x86 suspects to review. Perhaps time for a new spin? [0] https://lore.kernel.org/lkml/20201009201410.3209180-2-ira.weiny@intel.com/ Luis
On Tue, 2022-04-19 at 14:24 -0700, Luis Chamberlain wrote: > On Tue, Apr 19, 2022 at 01:56:03AM +0000, Edgecombe, Rick P wrote: > > Yea, that was my understanding. X86 modules have to be linked > > within > > 2GB of the kernel text, also eBPF x86 JIT generates code that > > expects > > to be within 2GB of the kernel text. > > And kprobes / live patching / ftrace. > > Another architectural fun fact, powerpc book3s/32 requires > executability > to be set per 256 Mbytes segments. Some architectures like this one > will want to also optimize how they use the module alloc area. > > Even though today the use cases might be limited, we don't exactly > know > how much memory a target device has a well, and so treating memory > failures for "special memory" request as regular memory failures > seems > a bit odd, and users could get confused. For instance slapping on > extra memory on a system won't resolve any issues if the limit for a > special type of memory is already hit. Very likely not a problem at > all today, > given how small modules / eBPF jit programs are / etc, but > conceptually it > would seem wrong to just say -ENOMEM when in fact it's a special type > of > required memory which cannot be allocated and the issue cannot > possibly be > fixed. I don't think we have an option but to use -ENOMEM but at > least > hinting of the special failure would have seem desirable. ENOMEM doesn't always mean out of physical memory though right? Could be hitting some other memory limit. Not sure where this discussion of the error code is coming from though. As far as the problem of eating a whole 2MB on small systems, this makes sense to me to worry about. Erratas limit what we can do here with swapping page sizes on the fly. Probably a sensible solution would be to decide whether to try based on system properties like boot memory size. Even without 2MB pages though, there are still improvements from these types of changes. > > Do we have other type of architectural limitations for "special > memory" > other than executable? Do we have *new* types of special memory we > should consider which might be similar / limited in nature? And can / > could / > should these architectural limitations hopefully be disappear in > newer CPUs? > I see vmalloc_pks() as you pointed out [0] . Anything else? Hmm, shadow stack permission memory could pop up in vmalloc someday. Not sure what you mean by architectural limitations. The relative addressing? If so, those other usages shouldn't be restricted by that. > > > I think of two types of caches we could have: caches of unmapped > > pages > > on the direct map and caches of virtual memory mappings. Caches of > > pages on the direct map reduce breakage of the large pages (and is > > somewhat x86 specific problem). Caches of virtual memory mappings > > reduce shootdowns, and are also required to share huge pages. I'll > > plug > > my old RFC, where I tried to work towards enabling both: > > > > > > https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/ > > > > Since then Mike has taken a lot further the direct map cache piece. > > > > Yea, probably a lot of JIT's are way smaller than a page, but there > > is > > also hopefully some performance benefit of reduced ITLB pressure > > and > > TLB shootdowns. I think kprobes/ftrace (or at least one of them) > > keeps > > its own cache of a page for putting very small trampolines. > > The reason I looked into *why* module_alloc() was used was > particularly > because it seemed a bit odd to have such ITLB enhancements for such > a niche use case and we couldn't have desired this elsewhere before. I think in general it is the only cross-arch way to get an allocation in the arch's executable compatible area (which some need as you say). module_alloc() is probably misnamed at this point with so many users that are not modules. > > > > Then, since it seems since the vmalloc area was not initialized, > > > wouldn't that break the old JIT spray fixes, refer to commit > > > 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against > > > spraying > > > attacks")? > > > > Hmm, yea it might be a way to get around the ebpf jit rlimit. The > > allocator could just text_poke() invalid instructions on "free" of > > the > > jit. > > > > > > > > Is that sort of work not needed anymore? If in doubt I at least > > > made > > > the > > > old proof of concept JIT spray stuff compile on recent kernels > > > [0], > > > but > > > I haven't tried out your patches yet. If this is not needed > > > anymore, > > > why not? > > > > IIRC this got addressed in two ways, randomizing of the jit offset > > inside the vmalloc allocation, and "constant blinding", such that > > the > > specific attack of inserting unaligned instructions as immediate > > instruction data did not work. Neither of those mitigations seem > > unworkable with a large page caching allocator. > > Got it, but was it *also* considerd in the fixes posted recently? I didn't read any discussion about it. But if it wasn't clear, I'm just an onlooker on bpf_prog_pack. I didn't see it until it was already upstream. Maybe Song can say. > > > > The collection of tribal knowedge around these sorts of things > > > would > > > be > > > good to not loose and if we can share, even better. > > > > Totally agree here. I think the abstraction I was exploring in that > > RFC > > could remove some of the special permission memory tribal knowledge > > that is lurking in in the cross-arch module.c. I wonder if you have > > any > > thoughts on something like that? The normal modules proved the > > hardest. > > Yeah modules will be harder now with the new > ARCH_WANTS_MODULES_DATA_IN_VMALLOC > which Christophe Leroy added (queued in my modules-next). Part of that work was separating out each module into 4 allocations, so it might make it easier. > At a quick > glance it seems like an API in the right direction, but you just need > more architecture folks other than the usual x86 suspects to review. > > Perhaps time for a new spin? I would very much like to, but I am currently too busy with another project. As such, I am mostly just trying to contribute ideas and my personal collection of the hidden knowledge. It sounded like from Song, someone might want to tackle it before I can get back to it. And, yes other arch's review would be critical to making sure it actually is a better interface and not just another one. > > [0] > https://lore.kernel.org/lkml/20201009201410.3209180-2-ira.weiny@intel.com/ > > Luis
On Tue, Apr 19, 2022 at 12:20:39PM -0700, Linus Torvalds wrote: > On Tue, Apr 19, 2022 at 11:42 AM Mike Rapoport <rppt@kernel.org> wrote: > > > > I'd say that bpf_prog_pack was a cure for symptoms and this project tries > > to address more general problem. > > But you are right, it'll take some time and won't land in 5.19. > > Just to update people: I've just applied Song's [1/4] patch, which > means that the whole current hugepage vmalloc thing is effectively > disabled (because nothing opts in). > > And I suspect that will be the status for 5.18, unless somebody comes > up with some very strong arguments for (re-)starting using huge pages. Here is the quote from Song's cover letter for bpf_prog_pack series: Most BPF programs are small, but they consume a page each. For systems with busy traffic and many BPF programs, this could also add significant pressure to instruction TLB. High iTLB pressure usually causes slow down for the whole system, which includes visible performance degradation for production workloads. The last sentence is the key. We've added this feature not because of bpf programs themselves. So calling this feature an optimization is not quite correct. The number of bpf programs on the production server doesn't matter. The programs come and go all the time. That is the key here. The 4k module_alloc() plus set_memory_ro/x done by the JIT break down huge pages and increase TLB pressure on the kernel code. That creates visible performance degradation for normal user space workloads that are not doing anything bpf related. mm folks can fill in the details here. My understanding it's something to do with identity mapping. So we're not trying to improve bpf performance. We're trying to make sure that bpf program load/unload doesn't affect the speed of the kernel. Generalizing bpf_prog_alloc to modules would be nice, but it's not clear what benefits such optimization might have. It's orthogonal here. So I argue that all 4 Song's fixes are necessary in 5.18. We need an additional zeroing patch too, of course, to make sure huge page doesn't have garbage at alloc time and it's cleaned after prog is unloaded. Regarding JIT spraying and other concerns. Short answer: nothing changed. JIT spraying was mitigated with start address randomization and invalid instruction padding. Both features are still present. Constant blinding is also fully functional. Any kind of generalization of bpf_prog_pack into general mm feature would be nice, but it cannot be done as opportunistic cache. We need a guarantee that bpf prog/unload won't recreate the issue with kernel performance degradation. I suspect we would need bpf_prog_pack in the current form for foreseeable future.
On Tue, Apr 19, 2022 at 7:03 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > Here is the quote from Song's cover letter for bpf_prog_pack series: I care about performance as much as the next person, but I care about correctness too. That large-page code was a disaster, and was buggy and broken. And even with those four patches, it's still broken. End result: there's no way that thigh gets re-enabled without the correctness being in place. At a minimum, to re-enable it, it needs (a) that zeroing and (b) actual numbers on real loads. (not some artificial benchmark). Because without (a) there's no way in hell I'll enable it. And without (b), "performance" isn't actually an argument. Linus
On Tue 2022-04-19 14:24:38, Luis Chamberlain wrote: > On Tue, Apr 19, 2022 at 01:56:03AM +0000, Edgecombe, Rick P wrote: > > Yea, that was my understanding. X86 modules have to be linked within > > 2GB of the kernel text, also eBPF x86 JIT generates code that expects > > to be within 2GB of the kernel text. > > And kprobes / live patching / ftrace. > > Another architectural fun fact, powerpc book3s/32 requires executability > to be set per 256 Mbytes segments. Some architectures like this one > will want to also optimize how they use the module alloc area. > > Even though today the use cases might be limited, we don't exactly know > how much memory a target device has a well, and so treating memory > failures for "special memory" request as regular memory failures seems > a bit odd, and users could get confused. For instance slapping on > extra memory on a system won't resolve any issues if the limit for a > special type of memory is already hit. Very likely not a problem at all today, > given how small modules / eBPF jit programs are / etc, but conceptually it > would seem wrong to just say -ENOMEM when in fact it's a special type of > required memory which cannot be allocated and the issue cannot possibly be > fixed. I don't think we have an option but to use -ENOMEM but at least > hinting of the special failure would have seem desirable. I am not mm expert but I think that this is common problem. There are many types of "special memory". And mm provides many details via procfs, e.g. /proc/meminfo, /proc/slabinfo, /proc/vmstat. Best Regards, Petr
Hi Linus, > On Apr 19, 2022, at 7:18 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, Apr 19, 2022 at 7:03 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: >> >> Here is the quote from Song's cover letter for bpf_prog_pack series: > > I care about performance as much as the next person, but I care about > correctness too. > > That large-page code was a disaster, and was buggy and broken. > > And even with those four patches, it's still broken. > > End result: there's no way that thigh gets re-enabled without the > correctness being in place. > > At a minimum, to re-enable it, it needs (a) that zeroing and (b) > actual numbers on real loads. (not some artificial benchmark). > > Because without (a) there's no way in hell I'll enable it. > > And without (b), "performance" isn't actually an argument. I will send patch to do (a) later this week. For (b), we have seen direct map fragmentation causing visible performance drop for our major services. This is the shadow production benchmark, so it is not possible to run it out of our data centers. Tracing showed that BPF program was the top trigger of these direct map splits. Thanks, Song
On Wed, Apr 20, 2022 at 02:42:37PM +0000, Song Liu wrote: > For (b), we have seen direct map fragmentation causing visible > performance drop for our major services. This is the shadow > production benchmark, so it is not possible to run it out of > our data centers. Tracing showed that BPF program was the top > trigger of these direct map splits. It's often not easy to reproduce issues like these, but I've ran into that before for other Proof of Concept issues before and the solution has been a Linux selftest. For instance a "multithreaded" bombing for kmod can be triggered with lib/test_kmod.c and tools/testing/selftests/kmod/kmod.sh Would desinging a selftest to abuse eBPF JIT be a possible way to reproduce the issue? Luis
Excerpts from Christoph Hellwig's message of April 16, 2022 3:08 pm: > On Fri, Apr 15, 2022 at 12:05:42PM -0700, Luis Chamberlain wrote: >> Looks good except for that I think this should just wait for v5.19. The >> fixes are so large I can't see why this needs to be rushed in other than >> the first assumptions of the optimizations had some flaws addressed here. > > Patches 1 and 2 are bug fixes for regressions caused by using huge page > backed vmalloc by default. So I think we do need it for 5.18. No, the huge vmap patch should just be reverted because that caused the regression, rather than adding another hack on top of it. All the breakage is in arch/x86/, it doesn't make sense to change this code and APIs outside x86 to work around it. And once they are fixed these shouldn't be needed. Thanks, Nick
Excerpts from Linus Torvalds's message of April 20, 2022 5:20 am: > On Tue, Apr 19, 2022 at 11:42 AM Mike Rapoport <rppt@kernel.org> wrote: >> >> I'd say that bpf_prog_pack was a cure for symptoms and this project tries >> to address more general problem. >> But you are right, it'll take some time and won't land in 5.19. > > Just to update people: I've just applied Song's [1/4] patch, which > means that the whole current hugepage vmalloc thing is effectively > disabled (because nothing opts in). > > And I suspect that will be the status for 5.18, unless somebody comes > up with some very strong arguments for (re-)starting using huge pages. Why not just revert fac54e2bfb5b ? Thanks, Nick
On Wed, Apr 20, 2022 at 8:25 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > Why not just revert fac54e2bfb5b ? That would be stupid, with no sane way forward. The fact is, HUGE_VMALLOC was badly misdesigned, and enabling it on x86 only ended up showing the problems. It wasn't fac54e2bfb5b that was the fundamental issue. It was the whole "oh, we should never have done it that way to begin with". The whole initial notion that HAVE_ARCH_HUGE_VMALLOC means that there must be no PAGE_SIZE pte assumptions was simply broken. There were actual real cases that had those assumptions, and the whole "let's just change vmalloc behavior by default and then people who don't like it can opt out" was just fundamentally a really bad idea. Power had that random "oh, we don't want to do this for module_alloc", which you had a comment about "more testing" for. And s390 had a case of hardware limitations where it didn't work for some cases. And then enabling it on x86 turned up more issues. So yes, commit fac54e2bfb5b _exposed_ things to a much larger audience. But all it just made clear was that your original notion of "let's change behavior and randomly disable it as things turn up" was just broken. Including "small" details like the fact that apparently VM_FLUSH_RESET_PERMS didn't work correctly any more for this, which caused issues for bpf, and that [PATCH 4/4]. And yes, there was a half-arsed comment ("may require extra work") to that effect in the powerpc __module_alloc() function, but it had been left to others to notice separately. So no. We're not going back to that completely broken model. The lagepage thing needs to be opt-in, and needs a lot more care. Linus
On Wed, Apr 20, 2022 at 10:48 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The lagepage thing needs to be opt-in, and needs a lot more care. Side note: part of the opt-in really should be about the performance impact. It clearly can be quite noticeable, as outlined by that powerpc case in commit 8abddd968a30 ("powerpc/64s/radix: Enable huge vmalloc mappings"), but it presumably is some _particular_ case that actually matters. But it's equalyl clearly not the module code/data case, since __module_alloc() explicitly disables largepages on powerpc. At a guess, it's one or more of the large hash-table allocations. And it would actually be interesting to hear *which*one*. From the 'git diff' workload, I'd expect it to be the dentry lookup hash table - I can't think of anything else that would be vmalloc'ed that would be remotely interesting - but who knows. So I think the whole "opt in" isn't _purely_ about the "oh, random cases are broken for odd reasons, so let's not enable it by default". I think it would actually be good to literally mark the cases that matter (and have the performance numbers for those cases). Linus
> On Apr 20, 2022, at 7:42 AM, Song Liu <songliubraving@fb.com> wrote: > > Hi Linus, > >> On Apr 19, 2022, at 7:18 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> >> On Tue, Apr 19, 2022 at 7:03 PM Alexei Starovoitov >> <alexei.starovoitov@gmail.com> wrote: >>> >>> Here is the quote from Song's cover letter for bpf_prog_pack series: >> >> I care about performance as much as the next person, but I care about >> correctness too. >> >> That large-page code was a disaster, and was buggy and broken. >> >> And even with those four patches, it's still broken. >> >> End result: there's no way that thigh gets re-enabled without the >> correctness being in place. >> >> At a minimum, to re-enable it, it needs (a) that zeroing and (b) >> actual numbers on real loads. (not some artificial benchmark). >> >> Because without (a) there's no way in hell I'll enable it. >> >> And without (b), "performance" isn't actually an argument. > > I will send patch to do (a) later this week. > > For (b), we have seen direct map fragmentation causing visible > performance drop for our major services. This is the shadow > production benchmark, so it is not possible to run it out of > our data centers. Tracing showed that BPF program was the top > trigger of these direct map splits. Attached is the patch for (a). I also sent it to the mail list. Thanks, Song
Excerpts from Linus Torvalds's message of April 21, 2022 3:48 pm: > On Wed, Apr 20, 2022 at 8:25 PM Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Why not just revert fac54e2bfb5b ? > > That would be stupid, with no sane way forward. > > The fact is, HUGE_VMALLOC was badly misdesigned, and enabling it on > x86 only ended up showing the problems. > > It wasn't fac54e2bfb5b that was the fundamental issue. It was the > whole "oh, we should never have done it that way to begin with". > > The whole initial notion that HAVE_ARCH_HUGE_VMALLOC means that there > must be no PAGE_SIZE pte assumptions was simply broken. It didn't have that requirement so much as required it to be accounted for if the arch enabled it. > There were > actual real cases that had those assumptions, and the whole "let's > just change vmalloc behavior by default and then people who don't like > it can opt out" was just fundamentally a really bad idea. > > Power had that random "oh, we don't want to do this for module_alloc", > which you had a comment about "more testing" for. > > And s390 had a case of hardware limitations where it didn't work for some cases. > > And then enabling it on x86 turned up more issues. Those were (AFAIKS) all in arch code though. The patch was the fundamental issue for x86 because it had bugs. I don't quite see what your objection is to power and s390's working implementations. Some parts of the arch code could not cope with hue PTEs so they used small. Switching the API around to expect non-arch code to know whether or not it can use huge mappings is much worse. How is alloc_large_system_hash expected to know whether it may use huge pages on any given half-broken arch like x86? It's the same like we have huge iomap for a long time. No driver should be expect to have to understand that. > So yes, commit fac54e2bfb5b _exposed_ things to a much larger > audience. But all it just made clear was that your original notion of > "let's change behavior and randomly disable it as things turn up" was > just broken. > > Including "small" details like the fact that apparently > VM_FLUSH_RESET_PERMS didn't work correctly any more for this, which > caused issues for bpf, and that [PATCH 4/4]. Which is another arch detail. > And yes, there was a > half-arsed comment ("may require extra work") to that effect in the > powerpc __module_alloc() function, but it had been left to others to > notice separately. It had a comment in arch/Kconfig about it. Combing through the details of every arch is left to others who choose to opt-in though. > So no. We're not going back to that completely broken model. The > lagepage thing needs to be opt-in, and needs a lot more care. I don't think it should be opt-in at the caller level (at least not outside arch/). As I said earlier maybe we end up finding fragmentation to be a problem that can't be solved with simple heuristics tweaking so we could think about adding something to give size/speed hint tradeoff, but as for "can this caller use huge vmap backed memory", it should not be something drivers or core code has to think about. Thanks, Nick
Excerpts from Linus Torvalds's message of April 21, 2022 4:02 pm: > On Wed, Apr 20, 2022 at 10:48 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> The lagepage thing needs to be opt-in, and needs a lot more care. > > Side note: part of the opt-in really should be about the performance impact. > > It clearly can be quite noticeable, as outlined by that powerpc case > in commit 8abddd968a30 ("powerpc/64s/radix: Enable huge vmalloc > mappings"), but it presumably is some _particular_ case that actually > matters. > > But it's equalyl clearly not the module code/data case, since > __module_alloc() explicitly disables largepages on powerpc. > > At a guess, it's one or more of the large hash-table allocations. The changelog is explicit it is the vfs hashes. > And it would actually be interesting to hear *which*one*. From the > 'git diff' workload, I'd expect it to be the dentry lookup hash table > - I can't think of anything else that would be vmalloc'ed that would > be remotely interesting - but who knows. I didn't measure dentry/inode separately but it should mostly (~entirely?) be the dentry hash, yes. > So I think the whole "opt in" isn't _purely_ about the "oh, random > cases are broken for odd reasons, so let's not enable it by default". The whole concept is totally broken upstream now though. Core code absolutely can not mark any allocation as able to use huge pages because x86 is in some crazy half-working state. Can we use hugepage dentry cache with x86 with hibernation? With BPF? Who knows. > I think it would actually be good to literally mark the cases that > matter (and have the performance numbers for those cases). As per previous comment, not for correctness but possibly to help guide some heuristic. I don't see it being too big a deal though, a multi-MB vmalloc that can use hugepages probably wants to, quite small downside (fragmentation being about the only one, but there aren't a vast number of such allocations in the kernel to have been noticed as yet). Thanks, Nick
Excerpts from Linus Torvalds's message of April 21, 2022 3:48 pm: > On Wed, Apr 20, 2022 at 8:25 PM Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Why not just revert fac54e2bfb5b ? > > That would be stupid, with no sane way forward. Oh I missed this comment. Now I'm completely confused. Reverting the patch which caused the breakage *is* the sane way forward. Your tree is not broken after that so you're done. And if x86 wanted to select HUGE_VMALLOC in future then it can do so after fixing the issues it has with it, so that's that the sane way forward for that. What you have now is what's insane. HAVE_ARCH_HUGE_VMALLOC now means "you can ask for huge pages but it might crash in undocumented arch-specific circumstances so good luck". Thanks, Nick
On Thu, Apr 21, 2022 at 1:57 AM Nicholas Piggin <npiggin@gmail.com> wrote: > > Those were (AFAIKS) all in arch code though. No Nick, they really weren't. The bpf issue with VM_FLUSH_RESET_PERMS means that all your arguments are invalid, because this affected non-architecture code. So the bpf case had two independent issues: one was just bpf doing a really bad job at making sure the executable mapping was sanely initialized. But the other was an actual bug in that hugepage case for vmalloc. And that bug was an issue on power too. So your "this is purely an x86 issue" argument is simply wrong. Because I'm very much looking at that power code that says "oh, __module_alloc() needs more work". Notice? Can these be fixed? Yes. But they can't be fixed by saying "oh, let's disable it on x86". Although it's probably true that at that point, some of the issues would no longer be nearly as noticeable. Linus
On Thu, 2022-04-21 at 18:57 +1000, Nicholas Piggin wrote: > Those were (AFAIKS) all in arch code though. The patch was the > fundamental issue for x86 because it had bugs. This wasn't root caused to arch code: "BUG: Bad page state in process systemd-udevd" https://lore.kernel.org/all/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/ In fact it wasn't root caused at all. But on the surface it seemed like it didn't have to do with virtual page size assumptions. I wonder if it might have to do with the vmalloc huge pages using compound pages, then some caller doing vmalloc_to_page() and getting surprised with what they could get away with in the struct page. But, regardless there was an assumption, not proven, that there was some lurking cross-arch issue that could show up for any vmalloc huge page user. There is another good reason to opt-in to the current vmalloc huge page size implementation - vmalloc will round up to huge page size once the size exceeds the huge page size. Only the callers can know if the allocation is worth using extra memory for.
On Thu, Apr 21, 2022 at 8:47 AM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > I wonder if it > might have to do with the vmalloc huge pages using compound pages, then > some caller doing vmalloc_to_page() and getting surprised with what > they could get away with in the struct page. Very likely. We have 100+ users of vmalloc_to_page() in random drivers, and the gpu code does show up on that list. And is very much another case of "it's always been broken, but enabling it on x86 made the breakage actually show up in real life". Linus
Excerpts from Linus Torvalds's message of April 22, 2022 1:44 am: > On Thu, Apr 21, 2022 at 1:57 AM Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Those were (AFAIKS) all in arch code though. > > No Nick, they really weren't. > > The bpf issue with VM_FLUSH_RESET_PERMS means that all your arguments > are invalid, because this affected non-architecture code. VM_FLUSH_RESET_PERMS was because bpf uses the arch module allocation code which was not capable of dealing with huge pages in the arch specific direct map manipulation stuff was unable to deal with it. An x86 bug. > So the bpf case had two independent issues: one was just bpf doing a > really bad job at making sure the executable mapping was sanely > initialized. > > But the other was an actual bug in that hugepage case for vmalloc. > > And that bug was an issue on power too. I missed it, which bug was that? > > So your "this is purely an x86 issue" argument is simply wrong. > Because I'm very much looking at that power code that says "oh, > __module_alloc() needs more work". > > Notice? No I don't notice. More work to support huge allocations for executable mappings, sure. But the arch's implementation explicitly does not support that yet. That doesn't make huge vmalloc broken! Ridiculous. It works fine. > > Can these be fixed? Yes. But they can't be fixed by saying "oh, let's > disable it on x86". You did just effectively disable it on x86 though. And why can't it be reverted on x86 until it's fixed on x86?? > Although it's probably true that at that point, some of the issues > would no longer be nearly as noticeable. There really aren't all these "issues" you're imagining. They aren't noticable now, on power or s390, because they have non-buggy HAVE_ARCH_HUGE_VMALLOC implementations. If you're really going to insist on this will you apply this to fix (some of) the performance regressions it introduced? Thanks, Nick diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6e5b4488a0c5..b555f17e84d5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8919,7 +8919,10 @@ void *__init alloc_large_system_hash(const char *tablename, table = memblock_alloc_raw(size, SMP_CACHE_BYTES); } else if (get_order(size) >= MAX_ORDER || hashdist) { - table = __vmalloc(size, gfp_flags); + if (IS_ENABLED(CONFIG_PPC) || IS_ENABLED(CONFIG_S390)) + table = vmalloc_huge(size, gfp_flags); + else + table = __vmalloc(size, gfp_flags); virt = true; if (table) huge = is_vm_area_hugepages(table);
Excerpts from Linus Torvalds's message of April 22, 2022 2:15 am: > On Thu, Apr 21, 2022 at 8:47 AM Edgecombe, Rick P > <rick.p.edgecombe@intel.com> wrote: >> >> I wonder if it >> might have to do with the vmalloc huge pages using compound pages, then >> some caller doing vmalloc_to_page() and getting surprised with what >> they could get away with in the struct page. > > Very likely. We have 100+ users of vmalloc_to_page() in random > drivers, and the gpu code does show up on that list. > > And is very much another case of "it's always been broken, but > enabling it on x86 made the breakage actually show up in real life". Okay that looks like a valid breakage. *Possibly* fb_deferred_io_fault() using pages vmalloced to screen_buffer? Or a couple of the gpu drivers are playing with page->mapping as well, not sure if they're vmalloced. But the fix is this (untested at the moment). It's not some fundamental reason why any driver should care about allocation size, it's a simple bug in my code that missed that case. The whole point of the design is that it's transparent to callers! Thanks, Nick diff --git a/mm/vmalloc.c b/mm/vmalloc.c index e163372d3967..70933f4ed069 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2925,12 +2925,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, if (nr != nr_pages_request) break; } - } else - /* - * Compound pages required for remap_vmalloc_page if - * high-order pages. - */ - gfp |= __GFP_COMP; + } /* High-order pages or fallback path if "bulk" fails. */ @@ -2944,6 +2939,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid, page = alloc_pages_node(nid, gfp, order); if (unlikely(!page)) break; + /* + * Higher order allocations must be able to be treated as + * indepdenent small pages by callers (as they can with + * small page allocs). + */ + if (order) + split_page(page, order); /* * Careful, we allocate and map page-order pages, but
On Thu, Apr 21, 2022 at 4:30 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > VM_FLUSH_RESET_PERMS was because bpf uses the arch module allocation > code which was not capable of dealing with huge pages in the arch > specific direct map manipulation stuff was unable to deal with it. > An x86 bug. .. and a power one? The only thing really special in __module_alloc() on power is that same VM_FLUSH_RESET_PERMS. Why had you otherwise disabled it there on powerpc too? > > And that bug was an issue on power too. > > I missed it, which bug was that? See above. At least it's very strongly implied by how the powerpc __module_alloc() function also used VM_FLUSH_RESET_PERMS | VM_NO_HUGE_VMAP, because there isn't much else going on. > No I don't notice. More work to support huge allocations for > executable mappings, sure. But the arch's implementation explicitly > does not support that yet. That doesn't make huge vmalloc broken! > Ridiculous. It works fine. There are several other reports of problems that weren't related to permissions (at least not obviously so). You were pointed at one of them in this thread: https://lore.kernel.org/all/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/ and yes, it also happened on x86-64, but my point this whole time has been that x86-64 gets A *LOT* MORE TEST COVERAGE. See the difference? The fact that it has workedc for you on powerpc doesn't mean that it's fine on powerpc. It only means that powerpc gets about one thousandth of the test coverage that x86-64 gets. > You did just effectively disable it on x86 though. I disabled it *EVERYWHERE*. What is so hard to understand about that? Why are you so convinced this is about x86? It's not. > There really aren't all these "issues" you're imagining. They > aren't noticable now, on power or s390, because they have > non-buggy HAVE_ARCH_HUGE_VMALLOC implementations. So I really think you've not tested it. How many of those powerpc or s390 machines do you think test drivers that do vmalloc_to_page() and then do something with that 'struct page *'? Seriously. Why are you so convinced that "oh, any vmalloc() can be converted to large pages"? I really think the only reason you think that is because it ran on machines that basically have almost no drivers in use, and are all very homogenous, and just didn't happen to hit the bugs. IOW, I think you are probably entirely right that x86 has its own set of excitement (the bpf thread has this thing about how x86 does RO differently from other architectures), and there are likely x86 specific bugs *TOO*. But let's just pick a random driver that uses vmalloc (literally random - I just grepped for it and started looking at it): drivers/infiniband/hw/qib/qib_file_ops.c and it unquestionably does absolutely disgusting things, and if looking at the code makes you go "nobody should do that", then I won't disagree all that much. But as an example of what it does, it basically does things like this: rcd->subctxt_uregbase = vmalloc_user(...); and then you can mmap it in user space in mmap_kvaddr(): addr = rcd->subctxt_uregbase; size = PAGE_SIZE * subctxt_cnt; ... vma->vm_pgoff = (unsigned long) addr >> PAGE_SHIFT; vma->vm_ops = &qib_file_vm_ops; and then the page fault routine is: static const struct vm_operations_struct qib_file_vm_ops = { .fault = qib_file_vma_fault, }; and that function qib_file_vma_fault() does things like this: page = vmalloc_to_page((void *)(vmf->pgoff << PAGE_SHIFT)); if (!page) return VM_FAULT_SIGBUS; get_page(page); vmf->page = page; return 0; and let me just claim (a) I bet you have never EVER tested this kind of insane code on powerpc (b) do you think this will work great if vmalloc() allocates large pages? Can you now see what I'm saying? Can you now admit that the whole "nmake vmalloc() do large pages without explicit opt-in" was a HORRIBLE MISTAKE. > If you're really going to insist on this will you apply this to fix > (some of) the performance regressions it introduced? No. That patch is disgusting and there is no excuse for applying something crap like that. What I applied was the first in a series of patches that do it sanely. That whole "a sane way forward" thing. See https://lore.kernel.org/all/20220415164413.2727220-3-song@kernel.org/ for [PATCH 2/4] in the series for this particular issue. But I'm not applying anything else than the "disable this mess" before we have more discussion and consensus. And dammit, you had better just admit that this wasn't some x86-only thing. Powerpc and s390 were BROKEN GARBAGE AND JUST DIDN'T HAVE THE TEST COVERAGE. Seriously. And the thing is, your opt-out approach was just POINTLESS. The actual cases that are performance-critical are likely in the single digits. It's probably just that big-hash case and maybe a *couple* of other cases (ie the bpf jit really wants to use it). So opt-in is clearly the correct thing to do. Do you now understand why I applied that "users must opt-in" patch? Do you now understand that this is not some "x86" thing? Linus
Excerpts from Linus Torvalds's message of April 22, 2022 10:49 am: > On Thu, Apr 21, 2022 at 4:30 PM Nicholas Piggin <npiggin@gmail.com> wrote: >> >> VM_FLUSH_RESET_PERMS was because bpf uses the arch module allocation >> code which was not capable of dealing with huge pages in the arch >> specific direct map manipulation stuff was unable to deal with it. >> An x86 bug. > > .. and a power one? The only thing really special in __module_alloc() > on power is that same VM_FLUSH_RESET_PERMS. > > Why had you otherwise disabled it there on powerpc too? Right, it is for that (and possibly other similar ro->nx stuff for module loading and instruction patching etc). That's why we don't do huge vmalloc for executable mappings (yet). I don't understand why you're trying to claim that limitation is a bug in power or warrants disabling the whole thing for data mappings as well. > >> > And that bug was an issue on power too. >> >> I missed it, which bug was that? > > See above. At least it's very strongly implied by how the powerpc > __module_alloc() function also used > > VM_FLUSH_RESET_PERMS | VM_NO_HUGE_VMAP, > > because there isn't much else going on. So that's not a bug though. To be clear, VM_FLUSH_RESET_PERMS is clearly never to be used in driver code, it's a low-level arch specific detail. So the flag itself does not have to be usable by any old random code. > >> No I don't notice. More work to support huge allocations for >> executable mappings, sure. But the arch's implementation explicitly >> does not support that yet. That doesn't make huge vmalloc broken! >> Ridiculous. It works fine. > > There are several other reports of problems that weren't related to > permissions (at least not obviously so). > > You were pointed at one of them in this thread: > > https://lore.kernel.org/all/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/ Right, I agree see the patch I posted. > > and yes, it also happened on x86-64, but my point this whole time has > been that x86-64 gets A *LOT* MORE TEST COVERAGE. > > See the difference? The fact that it has workedc for you on powerpc > doesn't mean that it's fine on powerpc. It only means that powerpc > gets about one thousandth of the test coverage that x86-64 gets. Surely it's not unreasonable that there will be *some* bugs when coverage is expanded. I'm not saying the code I write is perfect and bug-free when I say it's fine on power, I'm saying that in concept it is fine. > >> You did just effectively disable it on x86 though. > > I disabled it *EVERYWHERE*. > > What is so hard to understand about that? > > Why are you so convinced this is about x86? Well the x86 bugs in its direct mapping stuff seem to be. If they go away when not using huge mappings for modules then fine, I'm just not really across all the issues but it sounded like there were more than just this module case. > > It's not. > >> There really aren't all these "issues" you're imagining. They >> aren't noticable now, on power or s390, because they have >> non-buggy HAVE_ARCH_HUGE_VMALLOC implementations. > > So I really think you've not tested it. > > How many of those powerpc or s390 machines do you think test drivers > that do vmalloc_to_page() and then do something with that 'struct page > *'? Probably quite a few, but the intersection of those with ones that also allocate huge vmalloc *and* modify struct page fields to trip that bug was probably empty. > Seriously. Why are you so convinced that "oh, any vmalloc() can be > converted to large pages"? Because it can be transparent. The bug was (stupidly) using compound pages when it should have just used split higher order pages. But the whole idea of it is that it's supposed to be transparent to drivers. I don't get why you're freaking out about vmalloc_to_page(), a huge mapping is pretty much just like a small mapping except the pages are contiguous and the page tables are a bit different. > I really think the only reason you think that is because it ran on > machines that basically have almost no drivers in use, and are all > very homogenous, and just didn't happen to hit the bugs. > > IOW, I think you are probably entirely right that x86 has its own set > of excitement (the bpf thread has this thing about how x86 does RO > differently from other architectures), and there are likely x86 > specific bugs *TOO*. > > But let's just pick a random driver that uses vmalloc (literally > random - I just grepped for it and started looking at it): > > drivers/infiniband/hw/qib/qib_file_ops.c > > and it unquestionably does absolutely disgusting things, and if > looking at the code makes you go "nobody should do that", then I won't > disagree all that much. > > But as an example of what it does, it basically does things like this: > > rcd->subctxt_uregbase = vmalloc_user(...); > > and then you can mmap it in user space in mmap_kvaddr(): > > addr = rcd->subctxt_uregbase; > size = PAGE_SIZE * subctxt_cnt; > ... > vma->vm_pgoff = (unsigned long) addr >> PAGE_SHIFT; > vma->vm_ops = &qib_file_vm_ops; > > and then the page fault routine is: > > static const struct vm_operations_struct qib_file_vm_ops = { > .fault = qib_file_vma_fault, > }; > > and that function qib_file_vma_fault() does things like this: > > page = vmalloc_to_page((void *)(vmf->pgoff << PAGE_SHIFT)); > if (!page) > return VM_FAULT_SIGBUS; > > get_page(page); > vmf->page = page; > > return 0; > > and let me just claim > > (a) I bet you have never EVER tested this kind of insane code on powerpc > > (b) do you think this will work great if vmalloc() allocates large pages? Yes of course it will, with said 2-line bugfix. > Can you now see what I'm saying? > > Can you now admit that the whole "nmake vmalloc() do large pages > without explicit opt-in" was a HORRIBLE MISTAKE. No, can you admit that vmalloc_to_page is not some some crazy problem? The drivers are doing fairly horrible things sure, but *nothing* can look at the PTEs without going through accessors (except arch code obviously, which has to adapt before enabling). vmalloc_to_page() just gives you a struct page! Why would a driver care that it now happens to be contiguous with the next one or not?? >> If you're really going to insist on this will you apply this to fix >> (some of) the performance regressions it introduced? > > No. > > That patch is disgusting and there is no excuse for applying something > crap like that. > > What I applied was the first in a series of patches that do it sanely. > That whole "a sane way forward" thing. > > See > > https://lore.kernel.org/all/20220415164413.2727220-3-song@kernel.org/ > > for [PATCH 2/4] in the series for this particular issue. I was being facetious. The problem is you can't do ^ because x86 is buggy. > > But I'm not applying anything else than the "disable this mess" before > we have more discussion and consensus. > > And dammit, you had better just admit that this wasn't some x86-only thing. > > Powerpc and s390 were BROKEN GARBAGE AND JUST DIDN'T HAVE THE TEST COVERAGE. Everybody writes bugs, nobody has 100% test coverage. I missed something, big whoop. That doesn't invalidate the entire concept of the feature. > > Seriously. > > And the thing is, your opt-out approach was just POINTLESS. The actual > cases that are performance-critical are likely in the single digits. > > It's probably just that big-hash case and maybe a *couple* of other > cases (ie the bpf jit really wants to use it). > > So opt-in is clearly the correct thing to do. > > Do you now understand why I applied that "users must opt-in" patch? No, opt-in for correctness is stupid and unecessary because callers shouldn't have to know about it. That's the whole point of the huge vmalloc is that it's supposed to be transparent to non-arch code. That drm screen buffer thing involved in the bug -- why wouldn't that want to use huge pages? > > Do you now understand that this is not some "x86" thing? > If the only bug is that compound page thing and x86 is otherwise fine then sure, and patch 1 was unnecessary to begin with. Thanks, Nick
On Fri, 2022-04-22 at 10:12 +1000, Nicholas Piggin wrote: > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index e163372d3967..70933f4ed069 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -2925,12 +2925,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > if (nr != nr_pages_request) > break; > } > - } else > - /* > - * Compound pages required for remap_vmalloc_page if > - * high-order pages. > - */ > - gfp |= __GFP_COMP; > + } > > /* High-order pages or fallback path if "bulk" fails. */ > > @@ -2944,6 +2939,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > page = alloc_pages_node(nid, gfp, order); > if (unlikely(!page)) > break; > + /* > + * Higher order allocations must be able to be > treated as > + * indepdenent small pages by callers (as they can > with > + * small page allocs). > + */ > + if (order) > + split_page(page, order); > > /* > * Careful, we allocate and map page-order pages, but FWIW, I like this direction. I think it needs to free them differently though? Since currently assumes they are high order pages in that path. I also wonder if we wouldn't need vm_struct->page_order anymore, and all the places that would percolates out to. Basically all the places where it iterates through vm_struct->pages with page_order stepping. Besides fixing the bisected issue (hopefully), it also more cleanly separates the mapping from the backing allocation logic. And then since all the pages are 4k (from the page allocator perspective), it would be easier to support non-huge page aligned sizes. i.e. not use up a whole additional 2MB page if you only need 4k more of allocation size.
On Thu, Apr 21, 2022 at 6:51 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > > See > > > > https://lore.kernel.org/all/20220415164413.2727220-3-song@kernel.org/ > > > > for [PATCH 2/4] in the series for this particular issue. > > I was being facetious. The problem is you can't do ^ because x86 is > buggy. No, we probably *can* do that PATCH 2/4. I suspect x86 really isn't that buggy. The bugs are elsewhere (including other vmalloc_huge() uses). Really. Why can't you just admit that the major bug was in the hugepage code itself? You claim: > Because it can be transparent. The bug was (stupidly) using compound > pages when it should have just used split higher order pages. but we're in -rc3 for 5.18, and you seem to be entirely ignoring the fact that that stupid bug has been around for a *YEAR* now. Guess what? It was reported within *days* of the code having been enabled on x86. But for about a year, youv'e been convinced that powerpc is fine, because nobody ever reported it. And you *still* try to make this about how it's some "x86 bug", despite that bug not having been x86-specific AT ALL. Nick, please take a long look at yourself in the mirror. And stop this whole mindless "it's x86". The *ONLY* thing x86-64 did was to show that the code that had been enabled on powerpc for a year had gotten almost no testing there. And don't bother mentioning s390. It got even less coverage there. So exactly *because* bugs were uncovered in days by x86 enabling this, I'm not rushing to re-enable it until I think it's gone through more thinking and testing. And in particular, I really *really* want to limit the fallout. For example, your "two-liner fix" is not at all obvious. That broken code case used to have a comment that remap_vmalloc_page() required compound pages, and you just removed that whole thing as if it didn't matter, and split the page. (I also think the comment meant 'vmap_pages_range()', but whatever). And the thing is, I'm not entirely convinced that comment was wrong and could just be ignored. The freeing code in __vunmap() will do int i, step = 1U << page_order; for (i = 0; i < area->nr_pages; i += step) { struct page *page = area->pages[i]; BUG_ON(!page); mod_memcg_page_state(page, MEMCG_VMALLOC, -step); __free_pages(page, page_order); which now looks VERY VERY wrong. You've split the pages, they may be used as individual pages (possibly by other things), and then you now at freeing time treat them as a single compound page after all.. So your "trivial two-liner" that tried to fix a bug that has been there for a year now, itself seems quite questionable. Maybe it works, maybe it doesn't. My bet is "it doesn't". And guess what? I bet it worked just fine in your testing on powerpc, because you probably didn't actually have any real huge-page vmalloc cases except for those filesystem big-hash cases that never get free'd. So that "this code was completely buggy for a year on powerpc" never seemed to teach you anything about the code. And again - none of this is at all x86-specific. NOT AT ALL. So how about you admit you were wrong to begin with. That hugepage code needs more care before we re-enable it. Your two-liner wasn't so obvious after all, was it? I really think we're much safer saying "hugepage mappings only matter for a couple of things, and those things will *not* do sub-page games, so it's simple and safe". .. and that requires that opt-in model. Because your "it's transparent" argument has never ever actually been true, now has it? Linus
On Thu, Apr 21, 2022 at 7:29 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > FWIW, I like this direction. I think it needs to free them differently > though? Very much so. > Besides fixing the bisected issue (hopefully), it also more cleanly > separates the mapping from the backing allocation logic. And then since > all the pages are 4k (from the page allocator perspective), it would be > easier to support non-huge page aligned sizes. i.e. not use up a whole > additional 2MB page if you only need 4k more of allocation size. I don't disagree, but I think the real problem is that the whole "oen page_order per vmalloc() area" itself is a bit broken. For example, AMD already does this "automatic TLB size" thing for when you have multiple contiguous PTE entries (shades of the old alpha "page size hint" thing, except it's automatic and doesn't have explicit hints). And I'm hoping Intel will do something similar in the future. End result? It would actually be really good to just map contiguous pages, but it doesn't have anything to do with the 2MB PMD size. And there's no "fixed order" needed either. If you have mapping that is 17 pages in size, it would still be good to allocate them as a block of 16 pages ("page_order = 4") and as a single page, because just laying them out in the page tables that way will already allow AMD to use a 64kB TLB entry for that 16-page block. But it would also work to just do the allocations as a set of 8, 4, 4 and 1. But the whole "one page order for one vmalloc" means that doesn't work very well. Where I disagree (violently) with Nick is his contention that (a) this is x86-specific and (b) this is somehow trivial to fix. Let's face it - the current code is broken. I think the sub-page issue is not entirely trivial, and the current design isn't even very good for it. But the *easy* cases are the ones that simply don't care - the ones that powerpc has actually been testing. So for 5.18, I think it's quite likely reasonable to re-enable large-page vmalloc for the easy case (ie those big hash tables). Re-enabling it *all*, considering how broken it has been, and how little testing it has clearly gotten? And potentially not enabling it on x86 because x86 is so much better at showing issues? That's not what I want to do. If the code is so broken that it can't be used on x86, then it's too broken to be enabled on powerpc and s390 too. Never mind that those architectures might have so limited use that they never realized how broken they were.. Linus
Excerpts from Linus Torvalds's message of April 22, 2022 12:31 pm: > On Thu, Apr 21, 2022 at 6:51 PM Nicholas Piggin <npiggin@gmail.com> wrote: >> >> > See >> > >> > https://lore.kernel.org/all/20220415164413.2727220-3-song@kernel.org/ >> > >> > for [PATCH 2/4] in the series for this particular issue. >> >> I was being facetious. The problem is you can't do ^ because x86 is >> buggy. > > No, we probably *can* do that PATCH 2/4. I suspect x86 really isn't > that buggy. The bugs are elsewhere (including other vmalloc_huge() > uses). I thought Rick seemed to be concerned about other issues but if not then great, I would happily admit to being mistaken. > > Really. Why can't you just admit that the major bug was in the > hugepage code itself? Uh, I did. But not the concept. You admit you were going stupid and freaking about about vmalloc_to_page which is really not some insurmountable problem. > > You claim: > >> Because it can be transparent. The bug was (stupidly) using compound >> pages when it should have just used split higher order pages. > > but we're in -rc3 for 5.18, and you seem to be entirely ignoring the > fact that that stupid bug has been around for a *YEAR* now. I'm not ignoring that at all. Again I completely agree that path is unlikely to have been tested (or at least not reported) on powerpc or s390. > Guess what? It was reported within *days* of the code having been > enabled on x86. Is that supposed to be a gotcha? If I was cc'ed on it I might have fixed it a month ago, but why would it be unusual for major new coverage of such a big feature to expose a bug? > But for about a year, youv'e been convinced that powerpc is fine, > because nobody ever reported it. > > And you *still* try to make this about how it's some "x86 bug", > despite that bug not having been x86-specific AT ALL. I think we're talking past one another. I never said that bug was x86 specific, it was only just brought to my attention now and I (presume that I) fixed it within 10 minutes of looking at it. That's not what I understood to be the only x86 problem though, but again as I said I'll admit I misunderstood that if it's not the case. [snip] > For example, your "two-liner fix" is not at all obvious. It is pretty obvious. Something wanted to use a tail page and ran afoul of the compound page thing. Splitting the page just gives us an array of conguous small pages. > That broken code case used to have a comment that remap_vmalloc_page() > required compound pages, and you just removed that whole thing as if > it didn't matter, and split the page. The new comment is a superset, I agree the old comment is under-done. remap_vmalloc_page() wanted to get_page the sub pages and without compound or split huge page they have a ref count of 0 so that blows up. split page just makes the pages behave exactly as they would if allocated individually (i.e., the small page case). Yes a stupid thinko on my part when I added that, but that's all it is, it's an obvious thing. > (I also think the comment meant 'vmap_pages_range()', but whatever). remap_vmalloc_range I think. But clearly the same issue applies to any caller which may use the struct pages for stuff. > And the thing is, I'm not entirely convinced that comment was wrong > and could just be ignored. The freeing code in __vunmap() will do > > int i, step = 1U << page_order; > > for (i = 0; i < area->nr_pages; i += step) { > struct page *page = area->pages[i]; > > BUG_ON(!page); > mod_memcg_page_state(page, MEMCG_VMALLOC, -step); > __free_pages(page, page_order); > > which now looks VERY VERY wrong. > > You've split the pages, they may be used as individual pages (possibly > by other things), and then you now at freeing time treat them as a > single compound page after all.. Oh yeah I'll take a look at the freeing side too, as I said totally untested but that's obviously the issue. > > So your "trivial two-liner" that tried to fix a bug that has been > there for a year now, itself seems quite questionable. > > Maybe it works, maybe it doesn't. My bet is "it doesn't". > > And guess what? I bet it worked just fine in your testing on powerpc, > because you probably didn't actually have any real huge-page vmalloc > cases except for those filesystem big-hash cases that never get > free'd. > > So that "this code was completely buggy for a year on powerpc" never > seemed to teach you anything about the code. > > And again - none of this is at all x86-specific. NOT AT ALL. > > So how about you admit you were wrong to begin with. > > That hugepage code needs more care before we re-enable it. > > Your two-liner wasn't so obvious after all, was it? > > I really think we're much safer saying "hugepage mappings only matter > for a couple of things, and those things will *not* do sub-page games, > so it's simple and safe". > > .. and that requires that opt-in model. > > Because your "it's transparent" argument has never ever actually been > true, now has it? Of course it has. You admit you didn't understand the design and freaked out about vmalloc_to_page when drivers playing games with struct page is a non issue. I'm not a moron, clearly if it wasn't intended to be transparent to callers it could not have just been enabled unconditionally. What's so difficult to you about the concept that the arch code is really the only thing that has to know about its page table arrangement? You're now fixating on this bug as though that invalidates the whole concept of huge vmalloc being transparent to callers. That's being disingenous though, it's just a bug. That's all it is. You merge fixes for hundreds of bugs everywhere in the tree every day. You're using it as a crutch because you can't admit you were wrong about it being transparent to drivers and thinking it has to be opt-in. Anyway blah whatever. I admit to everything, I did it, I'm bad I did a huge booboo a year ago, huge vmalloc is stupid, x86 is wonderful, everything else is garbage, Linus is great. Okay? I'll polish up the patch (*and* test it with a specifically written test case just for your), and then we can see if it solves that drm bug and we can go from there. Thanks, Nick
Excerpts from Edgecombe, Rick P's message of April 22, 2022 12:29 pm: > On Fri, 2022-04-22 at 10:12 +1000, Nicholas Piggin wrote: >> diff --git a/mm/vmalloc.c b/mm/vmalloc.c >> index e163372d3967..70933f4ed069 100644 >> --- a/mm/vmalloc.c >> +++ b/mm/vmalloc.c >> @@ -2925,12 +2925,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >> if (nr != nr_pages_request) >> break; >> } >> - } else >> - /* >> - * Compound pages required for remap_vmalloc_page if >> - * high-order pages. >> - */ >> - gfp |= __GFP_COMP; >> + } >> >> /* High-order pages or fallback path if "bulk" fails. */ >> >> @@ -2944,6 +2939,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >> page = alloc_pages_node(nid, gfp, order); >> if (unlikely(!page)) >> break; >> + /* >> + * Higher order allocations must be able to be >> treated as >> + * indepdenent small pages by callers (as they can >> with >> + * small page allocs). >> + */ >> + if (order) >> + split_page(page, order); >> >> /* >> * Careful, we allocate and map page-order pages, but > > FWIW, I like this direction. I think it needs to free them differently > though? Since currently assumes they are high order pages in that path. Yeah I got a bit excited there, but fairly sure that's the bug. I'll do a proper patch. > I also wonder if we wouldn't need vm_struct->page_order anymore, and > all the places that would percolates out to. Basically all the places > where it iterates through vm_struct->pages with page_order stepping. > > Besides fixing the bisected issue (hopefully), it also more cleanly > separates the mapping from the backing allocation logic. And then since > all the pages are 4k (from the page allocator perspective), it would be > easier to support non-huge page aligned sizes. i.e. not use up a whole > additional 2MB page if you only need 4k more of allocation size. Somewhat. I'm not sure about freeing pages back for general use when they have page tables pointing to them. That will be a whole different kettle of fish I suspect. Honestly I don't think fragmentation has proven to be that bad of a problem yet that we'd want to do something like that yet. But there are other reasons to decople them in general. Some ppc32 I think had a different page size that was not a huge page which it could use to map vmalloc memory, for example. In any case I have no objections to making that part of it more flexible. Thanks, Nick
Excerpts from Edgecombe, Rick P's message of April 22, 2022 12:29 pm: > On Fri, 2022-04-22 at 10:12 +1000, Nicholas Piggin wrote: >> diff --git a/mm/vmalloc.c b/mm/vmalloc.c >> index e163372d3967..70933f4ed069 100644 >> --- a/mm/vmalloc.c >> +++ b/mm/vmalloc.c >> @@ -2925,12 +2925,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >> if (nr != nr_pages_request) >> break; >> } >> - } else >> - /* >> - * Compound pages required for remap_vmalloc_page if >> - * high-order pages. >> - */ >> - gfp |= __GFP_COMP; >> + } >> >> /* High-order pages or fallback path if "bulk" fails. */ >> >> @@ -2944,6 +2939,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >> page = alloc_pages_node(nid, gfp, order); >> if (unlikely(!page)) >> break; >> + /* >> + * Higher order allocations must be able to be >> treated as >> + * indepdenent small pages by callers (as they can >> with >> + * small page allocs). >> + */ >> + if (order) >> + split_page(page, order); >> >> /* >> * Careful, we allocate and map page-order pages, but > > FWIW, I like this direction. I think it needs to free them differently > though? Since currently assumes they are high order pages in that path. > I also wonder if we wouldn't need vm_struct->page_order anymore, and > all the places that would percolates out to. Basically all the places > where it iterates through vm_struct->pages with page_order stepping. To respond to this, I do like having the huge page indicator in /proc/vmallocinfo. Which is really the only use of it. I might just keep it in for now. Thanks, Nick
Excerpts from Nicholas Piggin's message of April 22, 2022 1:08 pm: > Excerpts from Edgecombe, Rick P's message of April 22, 2022 12:29 pm: >> On Fri, 2022-04-22 at 10:12 +1000, Nicholas Piggin wrote: >>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c >>> index e163372d3967..70933f4ed069 100644 >>> --- a/mm/vmalloc.c >>> +++ b/mm/vmalloc.c >>> @@ -2925,12 +2925,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >>> if (nr != nr_pages_request) >>> break; >>> } >>> - } else >>> - /* >>> - * Compound pages required for remap_vmalloc_page if >>> - * high-order pages. >>> - */ >>> - gfp |= __GFP_COMP; >>> + } >>> >>> /* High-order pages or fallback path if "bulk" fails. */ >>> >>> @@ -2944,6 +2939,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >>> page = alloc_pages_node(nid, gfp, order); >>> if (unlikely(!page)) >>> break; >>> + /* >>> + * Higher order allocations must be able to be >>> treated as >>> + * indepdenent small pages by callers (as they can >>> with >>> + * small page allocs). >>> + */ >>> + if (order) >>> + split_page(page, order); >>> >>> /* >>> * Careful, we allocate and map page-order pages, but >> >> FWIW, I like this direction. I think it needs to free them differently >> though? Since currently assumes they are high order pages in that path. > > Yeah I got a bit excited there, but fairly sure that's the bug. > I'll do a proper patch. So here's the patch on top of the revert. Only tested on a lowly powerpc machine, but it does fix this simple test case that does what the drm driver is obviously doing: size_t sz = PMD_SIZE; void *mem = vmalloc(sz); struct page *p = vmalloc_to_page(mem + PAGE_SIZE*3); p->mapping = NULL; p->index = 0; INIT_LIST_HEAD(&p->lru); vfree(mem); Without the below fix the same exact problem reproduces: BUG: Bad page state in process swapper/0 pfn:00743 page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x743 flags: 0x7ffff000000000(node=0|zone=0|lastcpupid=0x7ffff) raw: 007ffff000000000 c00c00000001d0c8 c00c00000001d0c8 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: corrupted mapping in tail page Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc3-00082-gfc6fff4a7ce1-dirty #2810 Call Trace: [c000000002383940] [c0000000006ebb00] dump_stack_lvl+0x74/0xa8 (unreliable) [c000000002383980] [c0000000003dabdc] bad_page+0x12c/0x170 [c000000002383a00] [c0000000003dad08] free_tail_pages_check+0xe8/0x190 [c000000002383a30] [c0000000003dc45c] free_pcp_prepare+0x31c/0x4e0 [c000000002383a90] [c0000000003df9f0] free_unref_page+0x40/0x1b0 [c000000002383ad0] [c0000000003d7fc8] __vunmap+0x1d8/0x420 [c000000002383b70] [c00000000102e0d8] proc_vmalloc_init+0xdc/0x108 [c000000002383bf0] [c000000000011f80] do_one_initcall+0x60/0x2c0 [c000000002383cc0] [c000000001001658] kernel_init_freeable+0x32c/0x3cc [c000000002383da0] [c000000000012564] kernel_init+0x34/0x1a0 [c000000002383e10] [c00000000000ce64] ret_from_kernel_thread+0x5c/0x64 Any other concerns with the fix? Thanks, Nick -- mm/vmalloc: huge vmalloc backing pages should be split rather than compound Huge vmalloc higher-order backing pages were allocated with __GFP_COMP in order to allow the sub-pages to be refcounted by callers such as "remap_vmalloc_page [sic]" (remap_vmalloc_range). However a similar problem exists for other struct page fields callers use, for example fb_deferred_io_fault() takes a vmalloc'ed page and not only refcounts it but uses ->lru, ->mapping, ->index. This is not compatible with compound sub-pages. The correct approach is to use split high-order pages for the huge vmalloc backing. These allow callers to treat them in exactly the same way as individually-allocated order-0 pages. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> --- mm/vmalloc.c | 36 +++++++++++++++++++++--------------- 1 file changed, 21 insertions(+), 15 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 0b17498a34f1..09470361dc03 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2653,15 +2653,18 @@ static void __vunmap(const void *addr, int deallocate_pages) vm_remove_mappings(area, deallocate_pages); if (deallocate_pages) { - unsigned int page_order = vm_area_page_order(area); - int i, step = 1U << page_order; + int i; - for (i = 0; i < area->nr_pages; i += step) { + for (i = 0; i < area->nr_pages; i++) { struct page *page = area->pages[i]; BUG_ON(!page); - mod_memcg_page_state(page, MEMCG_VMALLOC, -step); - __free_pages(page, page_order); + mod_memcg_page_state(page, MEMCG_VMALLOC, -1); + /* + * High-order allocs for huge vmallocs are split, so + * can be freed as an array of order-0 allocations + */ + __free_pages(page, 0); cond_resched(); } atomic_long_sub(area->nr_pages, &nr_vmalloc_pages); @@ -2914,12 +2917,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, if (nr != nr_pages_request) break; } - } else - /* - * Compound pages required for remap_vmalloc_page if - * high-order pages. - */ - gfp |= __GFP_COMP; + } /* High-order pages or fallback path if "bulk" fails. */ @@ -2933,6 +2931,15 @@ vm_area_alloc_pages(gfp_t gfp, int nid, page = alloc_pages_node(nid, gfp, order); if (unlikely(!page)) break; + /* + * Higher order allocations must be able to be treated as + * indepdenent small pages by callers (as they can with + * small-page vmallocs). Some drivers do their own refcounting + * on vmalloc_to_page() pages, some use page->mapping, + * page->lru, etc. + */ + if (order) + split_page(page, order); /* * Careful, we allocate and map page-order pages, but @@ -2992,11 +2999,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, atomic_long_add(area->nr_pages, &nr_vmalloc_pages); if (gfp_mask & __GFP_ACCOUNT) { - int i, step = 1U << page_order; + int i; - for (i = 0; i < area->nr_pages; i += step) - mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC, - step); + for (i = 0; i < area->nr_pages; i++) + mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC, 1); } /*
On Thu, 2022-04-21 at 19:47 -0700, Linus Torvalds wrote: > I don't disagree, but I think the real problem is that the whole "oen > page_order per vmalloc() area" itself is a bit broken. Yea. It is the main reason it has to round up to huge page sizes AFAICT. I'd really like to see it use memory a little more efficiently if it is going to be an opt-out thing again. > > For example, AMD already does this "automatic TLB size" thing for > when > you have multiple contiguous PTE entries (shades of the old alpha > "page size hint" thing, except it's automatic and doesn't have > explicit hints). > > And I'm hoping Intel will do something similar in the future. > > End result? It would actually be really good to just map contiguous > pages, but it doesn't have anything to do with the 2MB PMD size. > > And there's no "fixed order" needed either. If you have mapping that > is 17 pages in size, it would still be good to allocate them as a > block of 16 pages ("page_order = 4") and as a single page, because > just laying them out in the page tables that way will already allow > AMD to use a 64kB TLB entry for that 16-page block. > > But it would also work to just do the allocations as a set of 8, 4, 4 > and 1. Hmm, that's neat. > > But the whole "one page order for one vmalloc" means that doesn't > work > very well. > > Where I disagree (violently) with Nick is his contention that (a) > this > is x86-specific and (b) this is somehow trivial to fix. > > Let's face it - the current code is broken. I think the sub-page > issue > is not entirely trivial, and the current design isn't even very good > for it. > > But the *easy* cases are the ones that simply don't care - the ones > that powerpc has actually been testing. > > So for 5.18, I think it's quite likely reasonable to re-enable > large-page vmalloc for the easy case (ie those big hash tables). > > Re-enabling it *all*, considering how broken it has been, and how > little testing it has clearly gotten? And potentially not enabling it > on x86 because x86 is so much better at showing issues? That's not > what I want to do. > > If the code is so broken that it can't be used on x86, then it's too > broken to be enabled on powerpc and s390 too. Never mind that those > architectures might have so limited use that they never realized how > broken they were.. I think there is another cross-arch issue here that we shouldn't lose sight of. There are not enough warnings in the code about the assumptions made on the arch's. The other issues are x86 specific in terms of who gets affected in rc1, but I dug up this prophetic assessment: https://lore.kernel.org/lkml/4488d39f-0698-7bfd-b81c-1e609821818f@intel.com/ That is pretty much what happened. Song came along and, in its current state, took it as a knob that could just be flipped. Seems pretty reasonable that it could happen again. So IMHO, the other general issue is the lack of guard rails or warnings for the next arch that comes along. Probably VM_FLUSH_RESET_PERMS should get some warnings as well. I kind of like the idea in that thread of making functions or configs for arch's to be forced to declare they have specific properties. Does it seem reasonable at this point? Probably not necessary as a 5.18 fix.
On Fri, 2022-04-22 at 14:31 +1000, Nicholas Piggin wrote:
> Any other concerns with the fix?
I had another concern with the vmalloc huge pages. Possibly getting
into paranoid territory here, but it's around the new behavior of
freeing vmalloc page tables. At least on x86 this doesn't happen very
often. AFAICT it all happens during boot or during memory hot unplug.
The risk this entails is racing against all the software page table
walks and walking a freed table. At least on x86 the walks of the
kernel tables are done with no locks, which works because the PTE
updates are atomic and pretty much never freed. Some of the kernel page
table walks in the fault handler are actually triggerable from
userspace.
So it's in the category of "already there is a bug" or a little trouble
someone could cause with a lot of effort. At least until PKS vmalloc
memory shows up. Anyway, as long as we are going over this all again I
thought it was worth bringing up.
On Fri, 2022-04-22 at 10:10 -0700, Edgecombe, Richard P wrote: > The risk this entails is racing against all the software page table > walks and walking a freed table. At least on x86 the walks of the > kernel tables are done with no locks, which works because the PTE > updates are atomic and pretty much never freed. Some of the kernel > page > table walks in the fault handler are actually triggerable from > userspace. Argh, please ignore this. I guess interrupts getting disabled in the fault handler forces the freeing of the page table to synchronize with the fault handler walks via the TLB flush IPIs that precede the free. Sorry for the noise.
[ I see that you posted a new version of the series, but I wasn't cc'd on that one, so I'm replying to the old thread instead ] On Sat, Apr 16, 2022 at 12:55 PM Song Liu <songliubraving@fb.com> wrote: > > Patch 2/4 enables huge pages for large hash. I decided that for 5.18, we want to at least fix the performance regression on powerpc, so I've applied the 2/4 patch to enable huge pages for the large hashes. I also enabled them for kvmalloc(), since that seemed like the one ObviouslySafe(tm) case of vmalloc use (famous last words, maybe I'll be informed of somebody who still did odd protection games on the result, but that really sounds invalid with the whole SLUB component). I'm not touching the bpf parts. I think that's a 5.19 issue by now, and since it's new, there's no equivalent performance regression issue. Linus
Hi Linus, > On Apr 24, 2022, at 10:43 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > [ I see that you posted a new version of the series, but I wasn't cc'd > on that one, so I'm replying to the old thread instead ] Thanks for fixing up these, and sorry for the messing up CC list. > > On Sat, Apr 16, 2022 at 12:55 PM Song Liu <songliubraving@fb.com> wrote: >> >> Patch 2/4 enables huge pages for large hash. > > I decided that for 5.18, we want to at least fix the performance > regression on powerpc, so I've applied the 2/4 patch to enable huge > pages for the large hashes. > > I also enabled them for kvmalloc(), since that seemed like the one > ObviouslySafe(tm) case of vmalloc use (famous last words, maybe I'll > be informed of somebody who still did odd protection games on the > result, but that really sounds invalid with the whole SLUB component). > > I'm not touching the bpf parts. I think that's a 5.19 issue by now, > and since it's new, there's no equivalent performance regression > issue. With 5.18-rc4, bpf programs on x86_64 are using 4kB bpf_prog_pack. So multiple small BPF programs will share a 4kB page. We still need to initialize each 4kB bpf_prog_pack with illegal instructions, with something similar to [1]. I will revise it based on Peter’s suggestion [2]. Thanks, Song [1] https://lore.kernel.org/all/20220422051813.1989257-2-song@kernel.org/ [2] https://lore.kernel.org/linux-mm/20220422073118.GR2731@worktop.programming.kicks-ass.net/